"Writing Code in Econ" Series
"Never memorize anything you can look up." ~ Albert Einstein, maybe
It's a good idea to keep your code organized. As projects get bigger, and as you step away from them and return, it's easier to get lost. A corollary to the supposed Einstein quote is "never memorize anything you can re-derive when you need it." It's easier to come back to code after a long time and go through logic like "a script like X would probably be in folder Y" than it is to read through every file and every folder searching for X.
The file and folder structure outlined here is one way to do that.
An important principle in coding is to only define something once, whether it's a variable or a function, and refer back to that singular definition everywhere else. No copying and pasting snippets of code! If you find an error in a sequence of code you've copied-and-pasted to 5 other files, you'd better hope you remember what those files are. This "do it once and never do it again" principle includes code that accesses data files, which is a lot of what we do.
To this end, we will write a separate (usually small) Python package for each major data source we use. This will make it easy to use the data again and again across different projects without copying and pasting.
The folders in such a project will generally look like this:
data-project-name/ |── setup.py |── data_project_name/ |── __init__.py |── util/ | |── __init__.py | |── env.py |── clean/ |── __init__.py |── raw.py
data-project-name is a Git repository that contains
everything else. The folder with the same name (substituting dashes for
underscores) just underneath it is the Python package within the Git
repository. They don't have to be the same, but it makes things easier. The
setup.py script is a boilerplate script used to install a Python
package so that it can be referenced in any other Python script on the
__init__.py files will usually be empty and are there to let Python
know that the folder is part of a Python package (see here).
Notice that there's no
data folder here. That's because the data files
are (usually ) not kept in the Python package or the Git repository.
First, they often can't be, since raw data files can be several gigabytes which
would make hosting the repository online at Github or Bitbucket difficult.
Second, they don't need to be, as we'll see with the
Since the data files often can't be stored in the repo, and never have to be
stored there, it's easiest to be uniform across all projects and just put the
data somewhere on your hard drive and point the rest of your code to where
that is (using the
data_path function, explained below).
The folder where the data files are stored still has a set structure:
mass-data-folder-on-big-hard-drive/ |── data-project-name-data/ | |── src/ | |── clean_file_1.dta | |── clean_file_2.dta |── other-data-project-name/ |── src/
The key principle here is to preserve the source data for each project which is
src/. The rule is that raw data goes from the BLS website or
where ever it came from straight into
src/ and the raw files are never
touched again. Never manually edit the raw CSV files. All cleaning is
programmatic, which means that you should be able to download the data fresh
from the source and immediately run the code.
That way you know that anything not in
src/ was created by you and can
be recreated as long as you have the code and the source data.
util/ folder is for scripts that will be used a lot within the data
cleaning itself but not by any other Python code or projects. In fact,
util/ will often just contain a single file,
util/env.py file contains environmental variables for the project,
hence the name. These variables define where on the hard drive the raw data
are stored, etc. A basic
util/env.py file looks like this:
import os data_root = r'd:/' DATA_PATH = os.path.join(data_root, 'Data', 'data-project-name') def data_path(*args): return os.path.join(DATA_PATH, *args) def src_path(*args): return data_path('src', *args)
data_path functions take a file name as a
string and appends all the folder information to it, so all you need to worry
about is the name of the actual file, not all the folders. The basic use of the
functions looks like this.
In : from util.env import data_path In : print(data_path('main_file.dta')) Out: 'd:\\Data\\data-project-name\\main_file.dta'
These functions are the canonical definitions of where the data files are found on the computer. All other scripts will refer to these definitions by importing them. For example, a function that cleans and saves a dataset might look like this:
import pandas as pd from util.env import src_path, data_path def clean_gdp_data(): # Read data from raw CSV file df = pd.read_csv(src_path('annual_gdp.csv')) # Fudge the numbers df['gdp'] = df['gdp'] * 2 # Save to Stata DTA df.to_stata(data_path('annual_gdp.dta')) if __name__ == '__main__': clean_gdp_data()
If you're working with big data files and have lots of people on your team, you
can use Python's builtin
socket library to code if-then statements that
data_root variable depending on the name of the computer
running the code.
src_path functions should never
be used in code outside
data_project_name. Other projects will have
their own data access functions.
clean folder contains scripts that clean the raw data.
Usually we'll call the barebones basic script that reads the source data
raw.py, but sometimes that's all there is. If it's a very simple
project, there may be a clean.py file instead of a folder.
Finally, after all the cleaning functions are written, we'll import them into
__init__.py file like this:
from clean.raw import load_data_1, load_data_2 # etc.
That way we don't have to worry about the interal file structure of
data-project-name when we're using the data package in other projects.
data-project-name use a
clean.py file or a
folder? Did it use any other folders? If we import any externally facing
functions into the project-level
__init__.py, it doesn't matter. All we
have to do in other projects is
from code_package_name import load_data_1
This is the advantage of installing the package using
making the data package accessible to any other scripts. You don't have to
remember the details of the cleaning code beyond the specific function you
want, and for that you just have to look in a single
Full projects that form the basis of an academic paper are structured in a similar way.
project-name/ |── draft/ <- Is a Git repo |── present/ <- Maybe a Git repo |── lit/ |── data/ | |── src/ |── out/ | |── 1807/ | |── 1808/ | |── plot_variable.png | |── reg_main.tex |── code/ <- Is a Git repo |── util/ | |── env.py |── clean/ |── analysis/ |── driver.sh |── summ_a_variable.py |── plot_a_variable.py |── reg_main.py
data/ folder is for incidental data that is specific to this
project. If it's data like CPI data or Census data that's likely to be used
again and again, it should be in its own data-only project.
draft/ folder is for drafts of the paper if/when we get that far,
bib files and anything else that goes with the draft. Same
present/ (presentations) and
lit/ (other papers from
the literature we'll need to refer to).
util/ folder is the same as above. The
util/env.py file will
also have an
out_path function that defines where we want the output of
analyses saved. This will usually just point to the
however we will often keep the
out/ folder on Dropbox so we can always
access our results. Actual figures and tables are saved in individual folders
out/ depending on the month the file was generated, e.g., 1808
for August 2018. Adding the month-year folder is handled automatically by
Python makes it very easy to import functions from one file into another. One danger of this that you can get circular imports, where Script A imports from Script B and Script B imports from Script A. Python will raise an error if this happens. This is surprisingly easy to do once you get nested imports, where A imports B imports C and so on.
To avoid this, the
code/ folder has a hierarchical structure:
util/: useful utility functions that will be used a lot all over the package. Things like coordinate converters or ID generators. Scripts in
util/never import from other scripts in the project. That way you know that any other script can use the tools in
util/. It's a universal donor to other scripts and never receives from them.
clean/: for incidental data cleaning. Can import from
util/but that's it.
analysis/: This is for prepping regression files and the like. Remember to define things once and only once. This goes for regression samples, too, and
analysis/is the place to put them. Can import from
The root folder,
code/: this is where we put scripts that create final output. Regressions, figures, summary stats, all here. These scripts can import from anywhere else in the project and they should never be imported from. If you write a function in reg_main.py that you want to use somehwere else, move it to
The final output scripts in
code/are prefixed by what they do,
summ_for summary stats,
plot_for plots, etc.
driver.sh. In theory, this is a simple script that
ideally would create all the final tables and figures for our paper. In
practice, the script is rarely run but serves as a shopping list of sorts to
remind us of the command line options, etc., that we've settled on. A simple
#! /bin/bash python reg_main.py --lag 3 python plot_a_variable.py --grayscale
Next in "Writing Code in Econ": Best Practices when Coding