Stata to Python Equivalents

Special thanks to John Coglianese for feedback and for supplying the list of "vital" Stata commands. Feedback and requests for additions to the list are always welcome!

The official Pandas documentation includes a "Comparison with Stata" page which is another great resource.

1   Intro/Note on Notation

Coding in Python is a little different than coding in Stata.

In Stata, you have one dataset in memory. The dataset is a matrix where each column is a "variable" with a unique name and each row has a number (the special variable _n). Everything in Stata is built around this paradigm.

Python is a general purpose programming language where a "variable" is not a column of data. Variables can be anything, a single number, a matrix, a list, a string, etc. The Pandas package implements a kind of variable called a DataFrame that acts a lot like the single dataset in Stata. It is a matrix where each column and each row has a name. The key distinction in Python is that a DataFrame is itself a variable and you can work with any number of DataFrames at one time. You can think of each column in a DataFrame as a variable just like in Stata, except that when you reference a column, you also have to specify the DataFrame.

The Stata-to-Python translations below are written assuming that you have a single DataFrame called df. Placeholders like <varname> and <dtafile> show where user-specified values go in each language. Note that in many cases, <varname> will be simple text in Stata (e.g., avg_income) while in Python it will be a string ('avg_income'). If you were to write df[avg_income] without quotes, Python would go looking for a variable--a list, a number, a string--that's been defined somewhere else. Because of this, <varlist> in Python represents a list of variable names: ['educ', 'income', 2017].

2   Note on composability

One concept that will push your Python skills forward quickly is "composability." That is, you can combine base-level commands to "create" whole new commands. This is a little more difficult in Stata where each line of code acts on an entire dataset.

For example, let's say you know two commands in Python/pandas. First, df[<condition>], which returns the rows of DataFrame df for which the Boolean <condition> is True. This is like keep if <condition> in Stata, except df[<condition>] is itself a DataFrame that can be acted upon independently of df itself. This means that you now know the general if <condition> syntax for every other pandas command.

So let's say the second command you know is df.describe() which is the equivalent of Stata's summary. This command doesn't have it's own dedicated if command. It doesn't need one. All you need is df[<condition>].describe().

In this way, you can create (i.e., "compose") new commands. You may notice that sometimes Python/pandas will require composed commands where Stata uses a one-liner (for example, see Stata's drop varstem* below). While it may may annoying to type a few more characters, this is actually a good thing! It means that there are fewer base commands to learn in pandas. It's like learning an alphabet with 26 letters and composing millions of words instead of learning millions of individual hieroglyphs.

This also means that sometimes there is more than one way to do things. Sometimes, it doesn't matter which way you do it. Other times, one way of doing things will execute more quickly. You can use the %timeit magic command in IPython to easily test which one is faster. But this will usually only an issue with with larger data sets (usually several gigabytes). For example, in large Groupby operations .shift(fillna=0) is much, much slower than shift().fillna(0) even though the end result is the same.

3   Input/Output

Stata Python
log using <file> Python doesn't display results automatically like Stata. You have to explicitly call the print function. Using a Jupyter notebook is the closest equivalent.
help <command>
  1. help(<command>) OR
  2. <command>? in IPython (as in pd.read_stata?)
cd some/other/directory
import os
but this is bad practice. Better practice is to use full pathnames whenever possible.
use my_file
import pandas as pd
df = pd.read_stata('my_file.dta')
use var1 var2 using my_file df = pd.read_stata('my_file.dta', columns=['var1', 'var2'])
import excel using <excelfile> df = pd.read_excel('<excelfile>')
import delimited using my_file.csv df = pd.read_csv('my_file.csv')
save my_file, replace
  1. df.to_stata('my_file.dta') OR
  2. df.to_pickle('my_file.pkl') for Python-native file type.
outsheet using my_file.csv, comma df.to_csv('my_file.csv')
export excel using <excel_name> df.to_excel('<excel_name>')

4   Sample Selection

Stata Python
keep if <condition> df = df[<condition>]
keep if a > 7 df = df[df['a'] > 7]
drop if <condition> df = df[~(<condition>)] where ~ is the logical negation operator in pandas and numpy (and bitwise negation for Python more generally).
keep if _n == 1
  1. df.first() OR
  2. df.iloc[0, :] Python is a 0-indexed language, so when counting the elements of lists and arrays, you start with 0 instead of 1.
keep if _n == _N
  1. df = df.last() OR
  2. df = df.iloc[-1, :]
keep if _n == 7 df = df.iloc[6, :] (Remember to count from 0)
keep if _n <= 10 df = df.iloc[:9, :] (Remember to count from 0)
keep var df = df['var']
keep var1 var2 df = df[['var1', 'var2']]
keep varstem* df = df.filter(like='varstem')
drop var
  1. del df['var'] OR
  2. df = df.drop('var', axis=1)
drop var1 var2 df = df.drop(['var1', 'var2'], axis=1)
drop varstem* df = df.drop(df.filter(like='varstem*').columns, axis=1)

5   Data Info and Summary Statistics

Stata Python
describe OR df.dtypes just to get data types. Note that Python does not have value labels like Stata does.
describe var df['var'].dtype
  1. df.shape[0] OR
  2. len(df). Here df.shape returns a tuple with the length and width of the DataFrame.
count if <condition>
  1. df[<condition>].shape[0] OR
  2. (<condition>).sum() if the condition involves a DataFrame, e.g., (df['age'] > 2).sum()
summ var df['var'].describe()
summ var if <condition>
  1. df[<condition>]['var'].describe() OR
  2. df.loc[<condition>, 'var'].describe()
summ var [aw = <weight>] Right now you have to calculate weighted summary stats manually. There are also some tools available in the Statsmodels package.
summ var, d df['var'].describe() plus df['var'].quantile([.1, .25, .5, .75, .9]) or whatever other statistics you want.
tab var df['var'].value_counts()
tab var1 var2
  1. pd.crosstab(df['var1'], df['var2'])

  2. df.groupby(['var1', 'var2'])

    Note that the .unstack(<var2>) here is purely cosmetic: it transforms the data from "long" to "wide" which is how crosstabs are usually presented.

tab <var1> <var2>, summarize(<func>)
df.groupby(['var1', 'var2'])

6   Variable Manipulation

Stata Python
gen newvar = <expression> df['newvar'] = <expression>
gen newvar = oldvar + 7 df['newvar'] = df['oldvar'] + 7
gen newvar = <expression> if <condition> df.loc[<condition>, 'newvar'] = <expression>. As with Stata, the rows of df that don't meet the condition will be missing (numpy.nan).
replace var = <expression> if <condition> df.loc[<condition>, 'var'] = <expression>
rename var newvar df = df.rename(columns={'var': 'newvar'}). You can also directly manipulate df.columns like a list: df.columns = ['a', 'b', 'c'].
inlist(var, <val1>, <val2>) df['var'].isin((<val1>, <val2>))
inrange(var, <val1>, <val2>) df['var'].between((<val1>, <val2>))
subinstr(<str>, " ", "_", .) df['var'].str.replace(' ', '_')
egen newvar = count(var) newvar = df['var'].notnull().sum(). NOTE: For these egen commands, newvar is a full (constant) column in Stata, while it is a scalar in Python. If you want a full constant column on your DataFrame, you can do df['newvar'] = 7 or whatever the constant is.
egen <newvar> = max(var) <newvar> = df['var'].max()
egen <newvar> = mean(var) <newvar> = df['var'].mean()
egen <newvar> = total(var) <newvar> = df['var'].sum()
egen <newvar> = group(var1 var2) <newvar> = econtools.group_id(df, cols=['var1', 'var2]) Please see the documentation for group_id.
egen newvar = <stat>(var), by(groupvar1 groupvar2) df['newvar'] = df.groupby(['groupvar1', 'groupvar2'])['var'].transform('<stat>').
collapse (sd) var (median) var ///
(max) var (min) var, ///
by(groupvar1 groupvar2)
df.groupby(['groupvar1', 'groupvar2'])['var'].agg(['std', 'median', 'min', 'max', 'sum'])
collapse (<stat>) var [iw = <weight>] Manually or maybe Statsmodels has a tool.
collapse (mean) var1, var2, by(groupvar1 groupvar2) df.groupby(['groupvar1', 'groupvar2'])[['var1', 'var2']].mean()
recode var (1/5 = 1) N/A, see note below.
recode var (1/5 = 1), gen(<newvar>) N/A.
label var var <label> N/A.
label define <labelname> 1 <valuelabel> N/A.
label values var <labelname> N/A.
label list <labelname> N/A.

Python doesn't have "labels" built into DataFrames like Stata does. However, you can use a dictionary to map data values to labels when necessary.

variable_labels = {
    1: "First Category",
    2: "Second Category",
    3: "Last Category",

7   Bysort

All bysort and egen commands will be compositions using groupby.agg or groupby.transform

Stata Python
bys group_var1 group_var2: gen group_sum = sum(var)
  1. df['group_sum'] = df.groupby(['group_var1', 'group_var2'])['var'].transform('sum')
  2. df['group_sum'] = df.groupby(['group_var1', 'group_var2'])['var'].transform(np.sum) Note that transform can take a string name of common functions like sum, mean, etc., or an arbitrary function to be executed, including np.sum or a function of your own creation.
bys group_var1 group_var2 (sort_var): keep if _n==1
  1. df['group_sum'] = (
        .sort_values(['group_var1', 'group_var2', 'sort_var1'])
        .drop_duplicates(['group_var1', 'group_var2'], keep='first')
  2. df['group_sum'] = (
        .sort_values(['group_var1', 'group_var2', 'sort_var1'])
        .groupby(['var1', 'var2'])
bys group_var1 group_var2 (sort_var): keep if _n==j where j is some arbitrary number
df['group_sum'] = (
    .sort_values(['group_var1', 'group_var2', 'sort_var1'])
    .groupby(['var1', 'var2'])
    .apply(lambda x: x.iloc[j, :])
bys group_var1 group_var2 (sort_var): gen jth_val = var[j] where j is some arbitrary number
df['jth_val'] = (
    .sort_values(['group_var1', 'group_var2', 'sort_var1'])
    .groupby(['var1', 'var2'])['var']
    .transform(lambda x: x.iloc[j])

Note that groupby was also used above to accomplish a collapse. Like, a Stata Collapse, a pandas Groupby reduces the size of the data set. If you want to instead execute what's called a window function in SQL, which does not reduce the shape of the original data set, you can call df.groupby.transform(<function>) instead of df.groupby.agg(<function>).

If for some reason transform doesn't work in your use case, you can also re-join the collapsed data back onto your original data set:

df = pd.read_csv('my_data.csv')
state_pop_sum = df.groupby('state')['population'].sum()
df = df.join(state_pop_sum.to_frame('state_pop_sum'),     # Give the new column a name besides 'population'

8   Panel Data

There is no general equivalent to tsset in Python. However, you can accomplish most if not all of the same tasks using a DataFrame's index (the row's equivalent of columns.) In Stata, the "DataFrame" in memory always has the observation row number, denoted by the Stata built-in variable _n. In Python and Pandas, a DataFrame index can be anything (though you can also refer to rows by the row number; see .loc vs iloc). It can also be hierarchical with mutiple levels. It is a much more general tool than tsset.

Stata Python
tsset panelvar timevar df = df.set_index(['panelvar', 'timevar'])
L.var df['var'].shift() NOTE: The index must be correctly sorted for shift to work the way you want it to. You will also probably need to use a groupby; see below.
L2.var df['var'].shift(2)
F.var df['var'].shift(-1)

8.1   Examples

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: df0 = pd.DataFrame({'var1': np.arange(6),
   ...:                     'id': [1, 1, 2, 2, 3, 3],
   ...:                     'period': [0, 1] * 3})

In [4]: print(df0)
   var1  id  period
0     0   1       0
1     1   1       1
2     2   2       0
3     3   2       1
4     4   3       0
5     5   3       1

In [5]: df = df0.set_index(['id', 'period']).sort_index()

In [6]: print(df)
id period
1  0          0
   1          1
2  0          2
   1          3
3  0          4
   1          5

In [7]: df['var1_lag'] = df.groupby(level='id')['var1'].shift()

In [8]: print(df)
           var1  var1_lag
id period
1  0          0       NaN
   1          1       0.0
2  0          2       NaN
   1          3       2.0
3  0          4       NaN
   1          5       4.0

In [9]: df['var1_for'] = df.groupby(level='id')['var1'].shift(-1)

In [10]: print(df)
           var1  var1_lag  var1_for
id period
1  0          0       NaN       1.0
   1          1       0.0       NaN
2  0          2       NaN       3.0
   1          3       2.0       NaN
3  0          4       NaN       5.0
   1          5       4.0       NaN

9   Merging and Joining

NOTE: Merging in Python is like R, SQL, etc. See additional explanation below.

Stata Python
append using <filename> df_joint = df1.append(df2)
merge 1:1 <vars> using <filename>
  1. df_joint = df1.join(df2) if <vars> are the DataFrames' indexes, or
  2. df_joint = pd.merge(df1, df2, on=<vars>) otherwise. Beware that pd.merge will not keep the index of either DataFrame.

Merging with Pandas DataFrames does not require you to specify "many-to-one" or "one-to-many". Pandas will figure that out based on whether the variables you're merging on are unique or not. However, you can specify what sub-sample of the merge to keep using the keyword argument how, e.g., df_joint = df1.join(df2, how='left') is the default for join while how='inner' is the default for pd.merge.

Pandas how Stata , keep() Intuition
how='left' keep(1, 3) Keeps all observations in the "left" DataFrame.
how='right' keep(2, 3) Keeps all observations in the "right" DataFrame.
how='inner' keep(3) Keeps observations that are in both DataFrames.
how='outer' keep(1 2 3) Keeps all observations.

10   Reshape

Like with merging, reshaping a DataFrame in Python is a bit different because of the paradigm shift from the "only one data table in memory" model of Stata to "a data table is just another object/variable" of Python. But this difference also makes reshaping a little easier in Python.

The most fundamental reshape commands in Python/pandas are stack and unstack:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: long = pd.DataFrame(np.arange(8),
   ...:                     columns=['some_variable'],
   ...:                     index=pd.MultiIndex.from_tuples(
   ...:                         [('a', 1), ('a', 2),
   ...:                          ('b', 1), ('b', 2),
   ...:                          ('c', 1), ('c', 2),
   ...:                          ('d', 1), ('d', 2)]))

In [4]: long.index.names=['unit_id', 'time']

In [5]: = 'varname'

In [6]: long
varname       some_variable
unit_id time
a       1                 0
        2                 1
b       1                 2
        2                 3
c       1                 4
        2                 5
d       1                 6
        2                 7

In [7]: wide = long.unstack('time')

In [8]: wide
varname some_variable
time                1  2
a                   0  1
b                   2  3
c                   4  5
d                   6  7

In [9]: long2 = wide.stack('time')

In [10]: long2
varname       some_variable
unit_id time
a       1                 0
        2                 1
b       1                 2
        2                 3
c       1                 4
        2                 5
d       1                 6
        2                 7

Here Input 3 creates a DataFrame, Input 4 gives each of the index columns a name, and Input 5 names the columns. Coming from Stata, it's a little weird to think of the column names themselves having a "name", but the columns names are just an index like the row names are. It starts to make more sense when you realize columns don't have to be strings. They can be integers, like years or FIPS codes. In those cases, it makes a lot of sense to give the columns a name so you know what you're dealing with.

Input 6 does the reshaping using unstack('time'), which takes the index 'time' and creates a new column for every unique value it has. Notice that the columns now have multiple levels, just like the index previously did. This is another good reason to label your index and columns. If you want to access either of those columns, you can do so as usual, using a tuple to differentiate between the two levels:

In [11]: wide[('some_variable', 1)]
a    0
b    2
c    4
d    6
Name: (some_variable, 1), dtype: int32

If you want to combine the two levels (like Stata defaults to), you can simply rename the columns:

In [13]: wide_single_level_column = wide.copy()

In [14]: wide_single_level_column.columns = [
    ...:        '{}_{}'.format(var, time)
    ...:        for var, time in wide_single_level_column.columns]

In [15]: wide_single_level_column
                  some_variable_1  some_variable_2
         a                      0                1
         b                      2                3
         c                      4                5
         d                      6                7

The pivot command can also be useful, but it's a bit more complicated than stack and unstack and is better to revisit pivot after you are comfortable working with DataFrame indexes and columns.

Stata Python
reshape <wide/long> <stubs>, i(<i_vars>) j(<j_var>)
wide: df.unstack(<level>)
long: df.stack('j_var')
see also df.pivot

11   Econometrics

Stata Python
ttest var1, by(var2)
from scipy.stats import ttest_ind
ttest_ind(array1, array2)
xi: i.var pd.get_dummies(df['var'])
i.var2#c.var1 pd.get_dummies(df[var2]).multiply(df[var1])
reg yvar xvar if <condition>, r
import econtools.metrics as mt
results = mt.reg(df[<condition>], 'yvar', 'xvar', robust=True)
reg yvar xvar if <condition>, vce(cluster cluster_var) results = mt.reg(df[<condition>], 'yvar', 'xvar', cluster='cluster_var')
areg yvar xvar1 xvar2, absorb(fe_var) results = mt.reg(df, 'yvar', ['xvar1', 'xvar2'], fe_name='fe_var')
predict newvar, resid newvar = results.resid
predict newvar, xb newvar = results.yhat
_b[var], _se[var] results.beta['var'],['var']
test var1 var2 results.Ftest(['var1', 'var2'])
test var1 var2, equal results.Ftest(['var1', 'var2'], equal=True)
lincom var1 + var2 econtools.metrics.f_test with appropriate parameters.
ivreg2 econtools.metrics.ivreg
outreg2 econtools.outreg
reghdfe None (hoping to add it to Econtools soon).

12   Plotting

Visualizations are best handled by the packages Matplotlib and Seaborn.

Stata Python
binscatter econtools.binscatter
maptile No quick tool, but easy to do with Cartopy.
coefplot ax.scatter(results.beta.index, results.beta) often works. Depends on context.
twoway scatter y_var x_var df.scatter('x_var', 'y_var')
twoway scatter y_var x_var if <condition> df[<condition>].scatter(x_var, y_var)
twoway <connected/line/area/bar/rarea> As above with appropriate parameters passed to df.plot function. However, it is better to use matplotlib or seaborn directly and call ax.plot(df['var1'], df['var2']). Like with merge, it's a different paradigm that needs more explanation.

13   Other differences

13.1   Missing values

In Python, missing values are represented by a NumPy "not a number" object, np.nan. In Stata, missing (.) is larger than every number, so 10 < . yields True. In Python, np.nan is never equal to anything. Any comparison involving np.nan is always False, even np.nan == np.nan.

To look for missing values in DataFrame columns, use any of the following.

  • df[<varname>].isnull() returns a vector of True and False values for each row of df[<varname>.
  • df[<varname>].notnull() is the complement of .isnull().
  • The function np.isnan(<arraylike>) takes an array and returns True or False for each element of the array (a DataFrame is a special type of array).

Another important difference is that np.nan is a floating point data type, so any column of a DataFrame that contains missing numbers will be floats. If a column of integers gets changed so that even one element is np.nan, the whole column will be converted to floats.

13.2   Floating point equality

In Stata, decimal numbers are never equal to anything, e.g., 3.0 == 3 is False. In Python, the above equality check returns True as long as the floating point error is within tolerances. However, you should be wary when doing large-scale equality checks using floats, like when merging two data sets using a variable/column that is a float. This is true for any language! (Read up on floating point error if you're curious why measuring equality of floats is tricky.)