Econometrics Tools

OLS

To estimate an OLS regression, you pass the reg() function at least three arguments

  1. The DataFrame that contains the data.
  2. The name of the dependent variable as a string.
  3. The name(s) of the independent variable(s) as a string (for one variable) or as a list.

Following these arguments, there are a number of keyword arguments for various other options. For example, the following code estimates a basic wage regression with state-level clustering and fixed effects, weighting by the variable sample_wt.

import pandas as pd
import econtools.metrics as mt

# Load a data file with columns 'ln_wage', 'educ', and 'state'
df = pd.read_csv('my_data.csv')

y = 'wage'
X = ['educ', 'age', 'male']
fe_var = 'state'
cluster_var = 'state'
weights_var = 'sample_wt'

results = mt.reg(
    df,                     # DataFrame
    y,                      # Dependent var (string)
    X,                      # Independent var(s) (string or list of strings)
    fe_name=fe_var,         # Fixed-effects/absorb var (string)
    cluster=cluster_var     # Cluster var (string)
    awt_name=weights_var    # Sample weights
)

Note that reg() does not automatically estimate a constant term. In order to have a constant/intercept in your model, you can (a) add a column of ones to your DataFrame, or (b) use the addcons keyword arg:

results = mt.reg(
    df,
    y,
    X,              # does not include a constant/intercept
    addcons=True    # Adds a constant term
)

Instrumental Variables

Estimating an instrumental variables model is very similar, but is done using the ivreg() function. The order of arguments is also slightly different in order to differentiate between the instruments, endogenous regressors, and exogenous regressors. Other keyword options, such as addcons, cluster, and so forth, are exactly the same as with reg().

One additional keyword argument is method, which sets the IV method used to estimate the model. Currently supported values are '2sls' (the default) and 'liml'.

# <Imports and loading data>

y = 'wage'              # Dependent var
X = ['educ']            # Endogenous regressor(s)
Z = ['treatment']       # Instrumental variable(s)
W = [ 'age', 'male']    # Exogenous regressor(s)

results = mt.ivreg(df, y, X, Z, W)

Returned Results

The regression functions reg() and ivreg() return a custom Results object that contains beta estimates, variance-covariance matrix, and other relevant info.

The easiest way to see regression results is the summary attribute. But direct access to estimates is also possible.

import pandas as pd
import econtools.metrics as mt

df = pd.read_stata('some_data.dta')
results = mt.reg(df, 'ln_wage', ['educ', 'age'], addcons=True)

# Print a nice summary of the regression results (a string)
print(results)

# Print DataFrame w/ betas, se's, t-stats, etc.
print(results.summary)

# Print only betas
print(results.beta)

# Print std. err. for `educ` coefficient
print(results.se['educ'])

# Print full variance-covariance matrix
print(results.vce)

The full list of attributes is listed here.

F tests

econtools.metrics contains two functions for conducting F tests.

The first, Ftest(), is for simple, Stata-like tests for joint significance or equality. It is a method on the Results object.

results = mt.reg(df, 'ln_wage', ['educ', 'age'], addcons=True)

# Test for joint significance
F1, pF1 = results.Ftest(['educ', 'age'])
# Test for equality
F2, pF2 = results.Ftest(['educ', 'age'], equal=True)

The second, f_test(), is for F tests of arbitrary linear combinations of coefficients. The tests are defined by an R matrix and an r vector such that the null hypothesis is \(R\beta = r\).

Other Estimation Options

Save memory by not computing predicted values

The save_mem flag can be used to reduce the memory footprint of the Results object by not saving predicted values for the dependent variable (yhat) and the residuals (resid), as well as the sample flag (sample). Since these vectors are always size N (or bigger for sample), setting save_mem=True can be very useful when running many regressions on large samples.

Check for colinear columns

The check_colinear flag can be used to check whether the list of regressors contains any colinear variables. More technically, when check_colinear is True, the regression function checks whether the regressor matrix X is full rank. If it is not full rank, it figures out which columns are colinear and prints the names of those columns to screen. It does not automatically drop colinear columns.

Because these checks can be computationally expensive, check_colinear defaults to False.

Spatial HAC (Conley errors)

Spatial HAC standard errors (as in Conley (1999), Kelejian and Prucha (2007), etc.) can be calculated by passing a dictionary with the relevant fields to the shac keyword:

shac_params = {
    'x': 'longitude',   # Column in `df`
    'y': 'latitude',    # Column in `df`
    'kern': 'unif',     # Kernel name
    'band': 2,          # Kernel bandwidth
}
df = pd.read_stata('reg_data.dta')
results = mt.reg(df, 'lnp', ['sqft', 'rooms'],
                 fe_name='state',
                 shac=shac_params)

Important

The band parameter is assumed to be in the same units as x and y. If x and y are degrees latitude/longitude, band should also be in degrees. econtools does not do any advanced geographic distance calculations here, just simple Euclidean distance.