7  Estimation Methods

~6 hours Standard Errors, Panel Data, Nonlinear Models Intermediate-Advanced

Learning Objectives

  • Distinguish between identification and estimation
  • Choose appropriate standard errors for your data structure
  • Handle panel data with fixed effects efficiently
  • Implement nonlinear models (logit, probit, Poisson)
  • Understand maximum likelihood and GMM estimation

7.1 Identification vs. Estimation

A common source of confusion: identification and estimation are different problems.

Identification (Module 6)

Can we learn the causal effect from data?

  • Research design and assumptions
  • What variation identifies the effect?
  • Parallel trends, exclusion restriction, etc.
  • Whether your estimate is causal or just correlational

Estimation (This Module)

How do we compute the estimate from data?

  • OLS, MLE, GMM, nonparametric methods
  • Standard errors and inference
  • Handling large datasets efficiently
  • Correcting for clustering, heteroskedasticity

You can have correct identification but bad estimation (wrong standard errors), or good estimation but no identification (precise estimate of a biased number). Both matter.

7.2 Standard Errors and Inference

Choosing the right standard errors is essential for valid inference. Using incorrect standard errors can lead to dramatically wrong conclusions—confidence intervals that are too narrow (false precision) or too wide (lost power). The two main considerations are:

  • Heteroskedasticity: When the variance of the error term varies across observations (e.g., income variance differs by wealth level)
  • Clustering: When observations are correlated within groups (e.g., students in the same classroom, repeated observations of the same person)

Robust (Heteroskedasticity-Consistent) Standard Errors

Classical OLS standard errors assume homoskedasticity: constant error variance across all observations. In practice, this rarely holds. Robust standard errors (also called Huber-White or sandwich standard errors) remain valid under heteroskedasticity without requiring you to model its form.

What to look for in the code

The key parameter is cov_type='HC1' (Python), robust or vce(robust) (Stata), or vcov = vcovHC() (R). HC0-HC3 are variants with different small-sample corrections—HC1 is the most common. In the output, compare standard errors: robust SEs are often larger than classical OLS SEs when heteroskedasticity is present, leading to wider confidence intervals and more conservative inference.

How robust SEs work in Python

In Python's statsmodels, you request robust standard errors by passing the cov_type argument to the .fit() method. Without this argument, .fit() uses classical (homoskedastic) standard errors. The argument name cov_type stands for "covariance type" — it tells the estimator which variance-covariance matrix formula to use when computing standard errors. The value 'HC1' selects the Huber-White sandwich estimator with a small-sample correction factor of n/(n-k), matching Stata's ,robust option exactly. The key thing to remember: coefficients stay identical regardless of cov_type — only the standard errors change.

The four HC variants differ only in their small-sample corrections: HC0 applies no correction (biased downward in small samples), HC1 applies n/(n-k) (matches Stata's ,robust), HC2 adjusts by leverage (better when some observations are influential), and HC3 applies the strongest correction (jackknife-like, conservative). For most applied work, HC1 is the standard choice.

# Python: Robust Standard Errors
import statsmodels.formula.api as smf

# Default (homoskedastic) standard errors
model = smf.ols('outcome ~ treatment + controls', data=df).fit()

# Robust (heteroskedasticity-consistent) HC1
model_robust = smf.ols('outcome ~ treatment + controls', data=df).fit(
    cov_type='HC1'
)
print(model_robust.summary())
* Stata: Robust Standard Errors

* Default standard errors
reg outcome treatment controls

* Robust (heteroskedasticity-consistent) - ALWAYS use this
reg outcome treatment controls, robust

* Equivalent: vce(robust)
reg outcome treatment controls, vce(robust)
# R: Robust Standard Errors
library(sandwich)
library(lmtest)

# Fit model
model <- lm(outcome ~ treatment + controls, data = df)

# Robust standard errors (HC1)
coeftest(model, vcov = vcovHC(model, type = "HC1"))

# Using fixest (cleaner syntax)
library(fixest)
model <- feols(outcome ~ treatment + controls, data = df, vcov = "hetero")
summary(model)
Python Output
                            OLS Regression Results (Robust)
==============================================================================
Dep. Variable:                outcome   R-squared:                       0.342
Model:                            OLS   Adj. R-squared:                  0.339
Method:                 Least Squares   F-statistic:                     127.4
Date:                Mon, 27 Jan 2026   Prob (F-statistic):           1.23e-47
No. Observations:                 500
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.4531      0.312      7.861      0.000       1.839       3.067
treatment      1.8724      0.187     10.014      0.000       1.505       2.240
controls       0.4215      0.098      4.301      0.000       0.229       0.614
==============================================================================
Notes: Standard errors are heteroskedasticity robust (HC1)
Stata Output
Linear regression                               Number of obs     =        500
                                                F(2, 497)         =     127.38
                                                Prob > F          =     0.0000
                                                R-squared         =     0.3418
                                                Root MSE          =     2.1847

------------------------------------------------------------------------------
             |               Robust
     outcome |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   treatment |   1.872431   .1869847    10.01   0.000     1.505087    2.239775
    controls |   .4215328   .0980127     4.30   0.000     .2289842    .6140814
       _cons |   2.453124   .3120458     7.86   0.000     1.839434    3.066814
------------------------------------------------------------------------------
R Output
t test of coefficients (Heteroskedasticity-consistent):

             Estimate Std. Error t value  Pr(>|t|)
(Intercept)  2.453124   0.312046  7.8612 2.35e-14 ***
treatment    1.872431   0.186985 10.0138 < 2.2e-16 ***
controls     0.421533   0.098013  4.3008 2.01e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Clustered Standard Errors

When observations are correlated within groups, standard errors that ignore this correlation will be too small—often dramatically so. Common examples include:

  • Students within schools: Students in the same school share teachers, facilities, and peer effects
  • Workers within firms: Workers at the same company share management, culture, and industry shocks
  • Repeated observations: The same individual observed multiple times over a panel
  • Geographic clustering: Individuals in the same state/region share local policies and economic conditions
What to look for in the code

The clustering variable is specified via cov_kwds={'groups': df['cluster_var']} (Python), cluster(cluster_var) (Stata), or cluster = ~cluster_var (R/fixest). The output will report the number of clusters—a rule of thumb is that you need at least 30-50 clusters for clustered standard errors to be reliable. With few clusters, consider wild cluster bootstrap (shown in Section 7.6).

# Python: Clustered Standard Errors
import statsmodels.formula.api as smf

# Cluster at the school level
model = smf.ols('outcome ~ treatment', data=df).fit(
    cov_type='cluster',
    cov_kwds={'groups': df['school_id']}
)

# Two-way clustering (e.g., state and year)
from linearmodels.panel import PanelOLS
df_panel = df.set_index(['state', 'year'])
model = PanelOLS.from_formula('outcome ~ treatment', data=df_panel)
results = model.fit(cov_type='clustered', cluster_entity=True, cluster_time=True)
* Stata: Clustered Standard Errors

* One-way clustering
reg outcome treatment controls, cluster(school_id)

* Two-way clustering
* ssc install reghdfe
reghdfe outcome treatment, absorb(state year) cluster(state year)

* Check number of clusters
tab school_id, matrow(clusters)
display r(r)
# R: Clustered Standard Errors
library(fixest)

# One-way clustering
model <- feols(outcome ~ treatment + controls,
               data = df,
               cluster = ~ school_id)
summary(model)

# Two-way clustering
model <- feols(outcome ~ treatment | state + year,
               data = df,
               cluster = ~ state + year)

# With sandwich package
library(sandwich)
model <- lm(outcome ~ treatment, data = df)
vcovCL(model, cluster = ~ school_id)
Python Output
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                outcome   R-squared:                        0.2847
Estimator:                   PanelOLS   R-squared (Between):              0.3124
No. Observations:                2500   R-squared (Within):               0.2847
Date:                Mon, 27 Jan 2026   R-squared (Overall):              0.2956
Cov. Estimator:             Clustered
                                        F-statistic:                      98.237
Entities:                          50   P-value                           0.0000
Time periods:                      50   Distribution:                  F(1,2448)
Cluster Var:           Entity & Time

                             Parameter Estimates
================================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
--------------------------------------------------------------------------------
treatment      2.1543     0.2174     9.9098     0.0000      1.7276      2.5810
================================================================================
Number of entity clusters: 50
Number of time clusters: 50
Stata Output
Linear regression                               Number of obs     =      2,500
                                                F(2, 49)          =      65.82
                                                Prob > F          =     0.0000
                                                R-squared         =     0.2847
                                                Root MSE          =     1.9234

                               (Std. Err. adjusted for 50 clusters in school_id)
------------------------------------------------------------------------------
             |               Robust
     outcome |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   treatment |   2.154312   .2173841     9.91   0.000     1.717406    2.591218
    controls |   .3821456   .1124587     3.40   0.001     .1561547    .6081365
       _cons |   1.287645   .4521378     2.85   0.006     .3793121    2.195978
------------------------------------------------------------------------------

Number of clusters (school_id) = 50
R Output
OLS estimation, Dep. Var.: outcome
Observations: 2,500
Standard-errors: Clustered (school_id)
             Estimate Std. Error  t value  Pr(>|t|)
(Intercept)  1.287645   0.452138  2.84791  0.006342 **
treatment    2.154312   0.217384  9.90978 1.23e-13 ***
controls     0.382146   0.112459  3.39830  0.001378 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.92344   Adj. R2: 0.28364
                 Std-Errors: Clustered (school_id), 50 clusters
When to Cluster

General rule: cluster at the level where:

  • Treatment is assigned (for experiments)
  • There's correlation in unobservables (for observational studies)

When in doubt, cluster at a more aggregate level. See Cameron & Miller (2015) for guidance.

Comparing Models: estimates store + esttab

A common workflow is to estimate several specifications (e.g., classical SEs vs. robust SEs, or progressively adding controls) and present them side-by-side. In Stata, the pattern is: run a regression, store its results with estimates store, repeat for each model, then display them together with esttab. In Python, the Stargazer package serves a similar role; in R, modelsummary or etable from fixest.

Compare like with like

When comparing models side-by-side, always use the same type of standard errors across all columns. Switching from classical SEs in one model to robust SEs in another makes it impossible to tell whether a change in significance is due to the added controls or the different SE type.

# Python: Compare models with Stargazer
from stargazer.stargazer import Stargazer
import statsmodels.formula.api as smf

# Model 1: baseline
m1 = smf.ols('wage ~ education', data=df).fit(cov_type='HC1')

# Model 2: add controls
m2 = smf.ols('wage ~ education + experience', data=df).fit(cov_type='HC1')

# Model 3: full specification
m3 = smf.ols('wage ~ education + experience + age', data=df).fit(cov_type='HC1')

# Side-by-side comparison
table = Stargazer([m1, m2, m3])
print(table.render_text())
* Stata: Compare models with esttab
* ssc install estout, replace

* Model 1: baseline
regress wage education, robust
estimates store baseline

* Model 2: add controls
regress wage education experience, robust
estimates store controls

* Model 3: full specification
regress wage education experience age, robust
estimates store full

* Side-by-side comparison table
esttab baseline controls full, se star(* 0.10 ** 0.05 *** 0.01)

* Export to LaTeX
esttab baseline controls full using "output/tables/results.tex", ///
    se replace star(* 0.10 ** 0.05 *** 0.01)
# R: Compare models with fixest::etable or modelsummary
library(fixest)

# Estimate three specifications
m1 <- feols(wage ~ education, data = df, vcov = "hetero")
m2 <- feols(wage ~ education + experience, data = df, vcov = "hetero")
m3 <- feols(wage ~ education + experience + age, data = df, vcov = "hetero")

# Side-by-side comparison
etable(m1, m2, m3, se.below = TRUE)

# Export to LaTeX
etable(m1, m2, m3, tex = TRUE, file = "output/tables/results.tex")
Python Output
=====================================================================
                          Dependent variable: wage
                    (1)            (2)            (3)
---------------------------------------------------------------------
education        1.247***       1.182***       1.174***
                 (0.089)        (0.082)        (0.083)

experience                      0.534***       0.487***
                                (0.067)        (0.078)

age                                            0.052
                                               (0.041)

Intercept        8.234***       5.127***       4.823***
                 (1.124)        (1.087)        (1.142)
---------------------------------------------------------------------
Observations       500            500            500
R-squared         0.284          0.371          0.373
Adj. R-squared    0.283          0.369          0.369
=====================================================================
Note:                         *p<0.1; **p<0.05; ***p<0.01
                              Robust standard errors in parentheses
Stata Output
------------------------------------------------------------
                      (1)          (2)          (3)
                     wage         wage         wage
------------------------------------------------------------
education           1.247***     1.182***     1.174***
                  (0.089)      (0.082)      (0.083)

experience                       0.534***     0.487***
                               (0.067)      (0.078)

age                                           0.052
                                            (0.041)

_cons               8.234***     5.127***     4.823***
                  (1.124)      (1.087)      (1.142)
------------------------------------------------------------
N                     500          500          500
R-sq                0.284        0.371        0.373
------------------------------------------------------------
Standard errors in parentheses
* p<0.10, ** p<0.05, *** p<0.01
R Output
                             m1           m2           m3
Dependent Var.:            wage         wage         wage

education           1.247***     1.182***     1.174***
                   (0.0893)     (0.0821)     (0.0834)
experience                       0.534***     0.487***
                                (0.0672)     (0.0781)
age                                           0.0524
                                             (0.0412)
Constant            8.234***     5.127***     4.823***
                   (1.124)      (1.087)      (1.142)
___________________________ ____________ ____________ ____________
S.E. type                 Heteroskedasticity-robust
Observations               500          500          500
R2                       0.28437      0.37082      0.37257
Adj. R2                  0.28293      0.36829      0.36878

Regression Diagnostics

After estimation, diagnostic tests help verify model assumptions and identify potential problems. Three key diagnostics are: variance inflation factors (detect multicollinearity), heteroskedasticity tests (validate SE choice), and residual plots (check model specification).

# Python: Regression diagnostics
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.diagnostic import het_breuschpagan
import matplotlib.pyplot as plt

model = smf.ols('wage ~ education + experience + age', data=df).fit()

# 1. Variance Inflation Factors (VIF) — detect multicollinearity
X = model.model.exog
for i, name in enumerate(model.model.exog_names):
    print(f"{name}: VIF = {variance_inflation_factor(X, i):.2f}")

# 2. Breusch-Pagan test for heteroskedasticity
bp_stat, bp_pval, _, _ = het_breuschpagan(model.resid, model.model.exog)
print(f"Breusch-Pagan p-value: {bp_pval:.4f}")

# 3. Residual vs. Fitted plot — check specification
plt.scatter(model.fittedvalues, model.resid, alpha=0.3)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residual vs. Fitted Plot')
plt.savefig('diagnostics.pdf', bbox_inches='tight')
* Stata: Regression diagnostics

reg wage education experience age, robust

* 1. Variance Inflation Factors
estat vif

* 2. Breusch-Pagan heteroskedasticity test

* Must re-estimate without robust for this test
quietly reg wage education experience age
estat hettest

* 3. Residual vs. Fitted plot
rvfplot, yline(0)
# R: Regression diagnostics
library(car)  # for vif()

model <- lm(wage ~ education + experience + age, data = df)

# 1. Variance Inflation Factors
vif(model)

# 2. Breusch-Pagan test
library(lmtest)
bptest(model)

# 3. Residual vs. Fitted plot
plot(model, which = 1)
Python Output
Intercept: VIF = 1.02
education: VIF = 2.84
experience: VIF = 7.12
age: VIF = 6.98

Breusch-Pagan p-value: 0.0023

[Residual vs. Fitted plot saved to diagnostics.pdf]
Stata Output
. estat vif

    Variable |       VIF       1/VIF
-------------+----------------------
  experience |      7.12    0.140449
         age |      6.98    0.143266
   education |      2.84    0.352113
-------------+----------------------
    Mean VIF |      5.65

. estat hettest

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
         Ho: Constant variance
         Variables: fitted values of wage

         chi2(1)      =     9.34
         Prob > chi2  =   0.0022
R Output
> vif(model)
 education experience        age
      2.84       7.12       6.98

> bptest(model)

	studentized Breusch-Pagan test

data:  model
BP = 9.3412, df = 3, p-value = 0.02508

[Residual vs. Fitted plot displayed]
Diagnostic Checklist

VIF > 10: Multicollinearity -- coefficients are unstable. Drop or combine correlated variables.
Breusch-Pagan p < 0.05: Heteroskedasticity detected -- use robust SEs (cov_type='HC1').
Curved residual plot: Misspecification -- try adding polynomial or log terms.
Funnel-shaped residuals: Heteroskedasticity -- confirms the need for robust SEs.

7.3 Panel Data Methods

Panel data (repeated observations of units over time) is the workhorse of applied microeconomics because it enables researchers to control for unobserved heterogeneity. The key intuition: if individuals differ in unobserved ways that affect both treatment and outcomes, cross-sectional comparisons are biased. Panel data lets us compare the same individual over time, differencing out time-invariant unobservables.

Fixed Effects

Fixed effects estimation includes a separate intercept for each unit (person, firm, country), absorbing all time-invariant characteristics. Mechanically, this is equivalent to "demeaning"—subtracting each unit's mean from all its observations. The identifying variation comes only from within-unit changes over time.

What to look for in the code

The key syntax elements are: entity_effects=True (Python linearmodels), fe or absorb(unit_id) (Stata), and | unit_id after the formula (R fixest). Two-way fixed effects add time fixed effects as well, controlling for common shocks (e.g., recessions) that affect all units.

Important: You can only estimate coefficients on variables that vary within units over time. Time-invariant characteristics (gender, race, birth cohort) are absorbed by the fixed effects and cannot be estimated—but they are controlled for.

# Python: Fixed Effects with linearmodels
from linearmodels.panel import PanelOLS

# Set panel index
df = df.set_index(['unit_id', 'year'])

# Unit fixed effects only
model = PanelOLS.from_formula(
    'outcome ~ treatment + time_varying_controls',
    data=df,
    entity_effects=True
)
results = model.fit(cov_type='clustered', cluster_entity=True)

# Two-way fixed effects (unit + time)
model = PanelOLS.from_formula(
    'outcome ~ treatment',
    data=df,
    entity_effects=True,
    time_effects=True
)
results = model.fit(cov_type='clustered', cluster_entity=True)
* Stata: Fixed Effects

* Declare panel structure
xtset unit_id year

* Unit fixed effects
xtreg outcome treatment time_varying_controls, fe cluster(unit_id)

* Two-way fixed effects with reghdfe (RECOMMENDED for speed)
reghdfe outcome treatment, absorb(unit_id year) cluster(unit_id)

* With multiple sets of fixed effects
reghdfe outcome treatment, absorb(unit_id year industry_id) cluster(unit_id)
# R: Fixed Effects with fixest
library(fixest)

# Unit fixed effects
model <- feols(outcome ~ treatment + controls | unit_id,
               data = df,
               cluster = ~ unit_id)
summary(model)

# Two-way fixed effects
model <- feols(outcome ~ treatment | unit_id + year,
               data = df,
               cluster = ~ unit_id)

# Multiple sets of fixed effects
model <- feols(outcome ~ treatment | unit_id + year + industry_id,
               data = df,
               cluster = ~ unit_id)

# With plm package (older, slower)
library(plm)
pdata <- pdata.frame(df, index = c("unit_id", "year"))
model <- plm(outcome ~ treatment, data = pdata, model = "within")
Python Output
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                outcome   R-squared:                        0.4218
Estimator:                   PanelOLS   R-squared (Between):              0.0147
No. Observations:                5000   R-squared (Within):               0.4218
Date:                Mon, 27 Jan 2026   R-squared (Overall):              0.3156
Time:                        14:32:17   Log-likelihood                   -8547.2
Cov. Estimator:             Clustered
                                        F-statistic:                      182.45
Entities:                         500   P-value                           0.0000
Time periods:                      10   Distribution:                  F(2,4497)

                             Parameter Estimates
================================================================================
                   Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------------
treatment             1.4872     0.1102    13.4954     0.0000      1.2712      1.7032
time_varying_ctrl     0.8234     0.0876     9.3995     0.0000      0.6516      0.9952
================================================================================

F-test for Poolability: 4.2341
P-value: 0.0000
Distribution: F(499,4497)

Included effects: Entity, Time
Stata Output
. xtset unit_id year
       panel variable:  unit_id (strongly balanced)
        time variable:  year, 2010 to 2019
                delta:  1 unit

. xtreg outcome treatment time_varying_controls, fe cluster(unit_id)

Fixed-effects (within) regression               Number of obs     =      5,000
Group variable: unit_id                         Number of groups  =        500

R-sq:                                           Obs per group:
     within  = 0.4218                                         min =         10
     between = 0.0147                                         avg =       10.0
     overall = 0.3156                                         max =         10

                                                F(2,499)          =     182.45
corr(u_i, Xb)  = 0.0234                         Prob > F          =     0.0000

                                  (Std. Err. adjusted for 500 clusters in unit_id)
----------------------------------------------------------------------------------
                 |               Robust
         outcome |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
       treatment |   1.487218   .1102341    13.50   0.000     1.270598    1.703838
time_varying_ctr |   .8234127   .0876234     9.40   0.000     .6512847    .9955407
           _cons |   2.134567   .2341876     9.11   0.000     1.674451    2.594683
-----------------+----------------------------------------------------------------
         sigma_u |  1.8723456
         sigma_e |  1.2345678
             rho |  .69712345   (fraction of variance due to u_i)
----------------------------------------------------------------------------------
R Output
OLS estimation, Dep. Var.: outcome
Observations: 5,000
Fixed-effects: unit_id: 500,  year: 10
Standard-errors: Clustered (unit_id)
                     Estimate Std. Error   t value   Pr(>|t|)
treatment            1.487218   0.110234  13.49536  < 2.2e-16 ***
time_varying_ctrl    0.823413   0.087623   9.39950  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.23457     Adj. R2: 0.42018
Fixef resid. var.: 1.52416  (500 + 10 FEs)

Random Effects

Random effects (RE) treats unit-specific effects as random draws from a distribution rather than fixed parameters. This is more efficient than fixed effects (smaller standard errors) but requires a stronger assumption: the unit effects must be uncorrelated with the regressors. If this assumption fails, RE estimates are inconsistent.

When to use Random Effects

In practice, RE is rarely preferred in economics research because the key assumption (unit effects uncorrelated with regressors) is hard to justify. The Hausman test compares FE and RE: if they differ significantly, RE is inconsistent and FE should be used. RE remains useful when you need to estimate coefficients on time-invariant variables, or when sample sizes are small and efficiency gains matter.

# Python: Random Effects
from linearmodels.panel import RandomEffects

df = df.set_index(['unit_id', 'year'])

model = RandomEffects.from_formula(
    'outcome ~ treatment + controls',
    data=df
)
results = model.fit()

# Hausman test: FE vs RE
from linearmodels.panel import compare
fe_model = PanelOLS.from_formula('outcome ~ treatment', data=df, entity_effects=True).fit()
re_model = RandomEffects.from_formula('outcome ~ treatment', data=df).fit()
print(compare({'FE': fe_model, 'RE': re_model}))
* Stata: Random Effects

* Random effects model
xtreg outcome treatment controls, re

* Hausman test
xtreg outcome treatment controls, fe
estimates store fe
xtreg outcome treatment controls, re
hausman fe .
# R: Random Effects
library(plm)

pdata <- pdata.frame(df, index = c("unit_id", "year"))

# Random effects model
re_model <- plm(outcome ~ treatment + controls,
                data = pdata,
                model = "random")
summary(re_model)

# Hausman test
fe_model <- plm(outcome ~ treatment + controls, data = pdata, model = "within")
phtest(fe_model, re_model)
Python Output
                        RandomEffects Estimation Summary
================================================================================
Dep. Variable:                outcome   R-squared:                        0.3524
Estimator:              RandomEffects   R-squared (Between):              0.4012
No. Observations:                5000   R-squared (Within):               0.3187
Date:                Mon, 27 Jan 2026   R-squared (Overall):              0.3524

                             Parameter Estimates
================================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
--------------------------------------------------------------------------------
Intercept      1.8923     0.1876    10.0865     0.0000      1.5246      2.2600
treatment      1.5234     0.0987    15.4346     0.0000      1.3299      1.7169
controls       0.7821     0.0654    11.9587     0.0000      0.6539      0.9103
================================================================================

                        Model Comparison
================================================================================
                                     FE              RE
--------------------------------------------------------------------------------
Dep. Variable                   outcome         outcome
Estimator                      PanelOLS    RandomEffects
No. Observations                   5000            5000
Cov. Estimator                  Unadjusted      Unadjusted
R-squared                         0.3187          0.3524

                                     FE              RE
--------------------------------------------------------------------------------
treatment                        1.4872          1.5234
                               (0.1102)        (0.0987)
================================================================================

Hausman test statistic: 8.234
P-value: 0.0041
Conclusion: Reject H0 at 5% level - use Fixed Effects
Stata Output
. xtreg outcome treatment controls, re

Random-effects GLS regression                   Number of obs     =      5,000
Group variable: unit_id                         Number of groups  =        500

R-sq:                                           Obs per group:
     within  = 0.3187                                         min =         10
     between = 0.4012                                         avg =       10.0
     overall = 0.3524                                         max =         10

                                                Wald chi2(2)      =     476.23
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000

------------------------------------------------------------------------------
     outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   treatment |   1.523412   .0987234    15.43   0.000     1.329919    1.716905
    controls |   .7821345   .0653876    11.96   0.000     .6539769    .9102921
       _cons |   1.892345   .1876234    10.09   0.000     1.524612    2.260078
-------------+----------------------------------------------------------------
     sigma_u |  1.4523456
     sigma_e |  1.2345678
         rho |  .58012345   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. hausman fe .

                 ---- Coefficients ----
             |      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
             |       fe           .          Difference          S.E.
-------------+----------------------------------------------------------------
   treatment |    1.487218     1.523412       -.0361943        .0412187
    controls |    .8234127     .7821345        .0412782        .0287654
------------------------------------------------------------------------------
                           b = consistent under Ho and Ha
            B = inconsistent under Ha, efficient under Ho

    Test:  Ho:  difference in coefficients not systematic

                  chi2(2) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                          =        8.23
                Prob>chi2 =      0.0041
                (V_b-V_B is not positive definite)
R Output
Oneway (individual) effect Random Effect Model
   (Swamy-Arora's transformation)

Call:
plm(formula = outcome ~ treatment + controls, data = pdata, model = "random")

Balanced Panel: n = 500, T = 10, N = 5000

Effects:
                  var std.dev share
idiosyncratic 1.5235   1.2343  0.42
individual    2.1098   1.4525  0.58
theta: 0.7821

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max.
-4.23456  -0.82345   0.01234   0.84567   4.56789

Coefficients:
             Estimate Std. Error z-value  Pr(>|z|)
(Intercept)  1.892345   0.187623  10.086 < 2.2e-16 ***
treatment    1.523412   0.098723  15.435 < 2.2e-16 ***
controls     0.782135   0.065388  11.959 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    12345.6
Residual Sum of Squares: 7987.23
R-Squared:      0.35243
Adj. R-Squared: 0.35217
Chisq: 476.23 on 2 DF, p-value: < 2.22e-16

	Hausman Test

data:  outcome ~ treatment + controls
chisq = 8.234, df = 2, p-value = 0.004123
alternative hypothesis: one model is inconsistent

First-Difference Estimator

First-differencing is an alternative to the within (demeaning) transformation that eliminates unit fixed effects by subtracting the previous period's observation. It is equivalent to FE when T=2, but differs with longer panels. First-differencing is particularly useful with unit-root processes.

# Python: First-Difference Estimator
from linearmodels.panel import FirstDifferenceOLS
import pandas as pd

df = df.set_index(['unit_id', 'year'])

# First-difference estimation
model_fd = FirstDifferenceOLS.from_formula(
    'outcome ~ treatment + time_varying_controls',
    data=df
)
results_fd = model_fd.fit(cov_type='clustered', cluster_entity=True)
print(results_fd.summary)
* Stata: First-Difference Estimator

xtset unit_id year

* First-difference: D.y on D.x
reg D.outcome D.treatment D.controls, cluster(unit_id)

* Using xtreg, fe with first-difference explicitly
xtreg outcome treatment controls, fd cluster(unit_id)
# R: First-Difference Estimator
library(plm)

# First-difference estimation
model_fd <- plm(outcome ~ treatment + controls,
                data = df,
                index = c("unit_id", "year"),
                model = "fd")
summary(model_fd)

# Compare FE vs FD
model_fe <- plm(outcome ~ treatment + controls,
                data = df,
                index = c("unit_id", "year"),
                model = "within")
cat("FE estimate:", coef(model_fe)["treatment"], "\n")
cat("FD estimate:", coef(model_fd)["treatment"], "\n")
Python Output
                    FirstDifferenceOLS Estimation Summary
================================================================================
Dep. Variable:                outcome   R-squared:                        0.2134
Estimator:        FirstDifferenceOLS   R-squared (Between):              0.0082
No. Observations:                4500   R-squared (Within):               0.2134
Date:                Mon, 27 Jan 2026   R-squared (Overall):              0.1876
Cov. Estimator:             Clustered
                                        F-statistic:                      87.234
Entities:                         500   P-value                           0.0000
Time periods:                       9   Distribution:                  F(2,3997)

                             Parameter Estimates
================================================================================
                       Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
----------------------------------------------------------------------------------------
D.treatment               1.4123     0.1523     9.2731     0.0000      1.1137      1.7109
D.time_varying_ctrl       0.7845     0.1012     7.7521     0.0000      0.5861      0.9829
================================================================================

Number of entity clusters: 500
Stata Output
. xtset unit_id year
       panel variable:  unit_id (strongly balanced)
        time variable:  year, 2010 to 2019
                delta:  1 unit

. reg D.outcome D.treatment D.controls, cluster(unit_id)

Linear regression                               Number of obs     =      4,500
                                                F(2, 499)         =      87.23
                                                Prob > F          =     0.0000
                                                R-squared         =     0.2134
                                                Root MSE          =     1.5423

                               (Std. Err. adjusted for 500 clusters in unit_id)
------------------------------------------------------------------------------
             |               Robust
   D.outcome |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 D.treatment |   1.412312   .1523456     9.27   0.000     1.113048    1.711576
  D.controls |   .7845123   .1012345     7.75   0.000     .5857821    .9832425
       _cons |   .0234567   .0312456     0.75   0.453    -.0379125    .0848259
------------------------------------------------------------------------------
R Output
Oneway (individual) effect First-Difference Model

Call:
plm(formula = outcome ~ treatment + controls, data = df,
    model = "fd", index = c("unit_id", "year"))

Balanced Panel: n = 500, T = 10, N = 5000
Observations used in estimation: 4500

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max.
-4.12345  -0.78234   0.00987   0.79876   4.34567

Coefficients:
             Estimate Std. Error t-value  Pr(>|t|)
(Intercept)  0.023457   0.031246  0.7507    0.4530
treatment    1.412312   0.152346  9.2731 < 2.2e-16 ***
controls     0.784512   0.101235  7.7521 1.23e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    13456.7
Residual Sum of Squares: 10587.2
R-Squared:      0.21337
Adj. R-Squared: 0.21302

FE estimate: 1.4872
FD estimate: 1.4123

Hausman Test: FE vs. RE

The Hausman test compares Fixed Effects and Random Effects estimates. Under the null hypothesis, both are consistent but RE is efficient; under the alternative, RE is inconsistent (the unit effects are correlated with regressors) and only FE is consistent. A significant test favors FE.

# Python: Hausman test
from linearmodels.panel import PanelOLS, RandomEffects
from linearmodels.panel import compare

df = df.set_index(['unit_id', 'year'])
# FE model
fe = PanelOLS.from_formula('outcome ~ treatment + controls',
                           data=df, entity_effects=True).fit()
# RE model
re = RandomEffects.from_formula('outcome ~ treatment + controls',
                                data=df).fit()

# Compare: Hausman-like test
print(compare({'FE': fe, 'RE': re}))
* Stata: Hausman test

* Estimate FE and store
xtreg outcome treatment controls, fe
estimates store fe_model

* Estimate RE and store
xtreg outcome treatment controls, re
estimates store re_model

* Hausman test
hausman fe_model re_model

* Robust Hausman test (with clustered SEs)
xtoverid
# R: Hausman test
library(plm)

# Estimate both models
model_fe <- plm(outcome ~ treatment + controls,
                data = df, index = c("unit_id", "year"),
                model = "within")
model_re <- plm(outcome ~ treatment + controls,
                data = df, index = c("unit_id", "year"),
                model = "random")

# Hausman test
hausman_test <- phtest(model_fe, model_re)
print(hausman_test)

# Decision rule:
# p < 0.05 → use Fixed Effects (unit effects correlated with X)
# p > 0.05 → Random Effects is more efficient
Python Output
                        Model Comparison
================================================================================
                                     FE              RE
--------------------------------------------------------------------------------
Dep. Variable                   outcome         outcome
Estimator                      PanelOLS    RandomEffects
No. Observations                   5000            5000
Cov. Estimator                Unadjusted      Unadjusted
R-squared                       0.4218          0.3524
R-squared (Between)             0.0147          0.4012
R-squared (Within)              0.4218          0.3187
R-squared (Overall)             0.3156          0.3524

                                     FE              RE
--------------------------------------------------------------------------------
treatment                        1.4872          1.5234
                               (0.1102)        (0.0987)
controls                         0.8234          0.7821
                               (0.0876)        (0.0654)
================================================================================
T-stat for Hausman:  2.869
P-value:  0.0041
Stata Output
. xtreg outcome treatment controls, fe
  (output omitted)
. estimates store fe_model

. xtreg outcome treatment controls, re
  (output omitted)
. estimates store re_model

. hausman fe_model re_model

                 ---- Coefficients ----
             |      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
             |    fe_model    re_model      Difference          S.E.
-------------+----------------------------------------------------------------
   treatment |    1.487218     1.523412       -.036194        .041219
    controls |    .8234127     .7821345        .041278        .028765
------------------------------------------------------------------------------
                           b = consistent under Ho and Ha
            B = inconsistent under Ha, efficient under Ho

    Test:  Ho:  difference in coefficients not systematic

                  chi2(2) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                          =        8.23
                Prob>chi2 =      0.0041

. xtoverid

Test of overidentifying restrictions: fixed vs random effects
Cross-section time-series model: xtreg re
Sargan-Hansen statistic   8.452  Chi-sq(2)   P-value = 0.0036
R Output
	Hausman Test

data:  outcome ~ treatment + controls
chisq = 8.234, df = 2, p-value = 0.004123
alternative hypothesis: one model is inconsistent

# Interpretation:
# p-value = 0.004 < 0.05
# Reject H0: RE is inconsistent
# Conclusion: Use Fixed Effects

Arellano-Bond: Dynamic Panel Data

When the model includes a lagged dependent variable (yt-1 on the right-hand side), fixed effects estimation is biased because the demeaned lagged dependent variable is correlated with the demeaned error. The Arellano-Bond (1991) estimator uses deeper lags of y as instruments for the differenced equation, providing consistent estimates in dynamic panels.

# Python: Arellano-Bond (using linearmodels or ivgmm)
# Note: Python support for Arellano-Bond is limited
# Manual implementation via first-difference + IV
from linearmodels.iv import IV2SLS

# First-difference the equation
df['dy'] = df.groupby('unit_id')['outcome'].diff()
df['dlag_y'] = df.groupby('unit_id')['lag_outcome'].diff()
df['dx'] = df.groupby('unit_id')['treatment'].diff()
df['lag2_y'] = df.groupby('unit_id')['outcome'].shift(2)

# 2SLS on differenced equation with lag-2 instruments
model = IV2SLS.from_formula(
    'dy ~ 1 + dx + [dlag_y ~ lag2_y]',
    data=df.dropna()
).fit()
print(model.summary)
* Stata: Arellano-Bond (xtabond)

xtset unit_id year

* Arellano-Bond one-step estimator
xtabond outcome treatment controls, lags(1) vce(robust)

* System GMM (Blundell-Bond): more efficient
xtdpdsys outcome treatment controls, lags(1) vce(robust)

* Arellano-Bond test for serial correlation
* AR(1) should be significant; AR(2) should NOT be
estat abond
# R: Arellano-Bond with plm
library(plm)

# Arellano-Bond one-step GMM
model_ab <- pgmm(
  outcome ~ lag(outcome, 1) + treatment + controls |
    lag(outcome, 2:99),  # use y_{t-2} and deeper as instruments
  data = df,
  index = c("unit_id", "year"),
  effect = "twoways",
  model = "onestep"
)
summary(model_ab)

# Sargan test (overidentification)
# p > 0.05 means instruments are valid

# AR(2) test: should be insignificant
# (no second-order serial correlation in differences)
Python Output
                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                     dy   R-squared:                       0.1234
Estimator:                    IV-2SLS   Adj. R-squared:                  0.1198
No. Observations:                 800   F-statistic:                     45.67
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
dx             0.3456      0.087      3.972      0.000       0.175       0.516
dlag_y         0.4123      0.134      3.077      0.002       0.150       0.675
==============================================================================

Endogeneity test (Wu-Hausman):  8.234  p-value: 0.0041
Instrument relevance (F-stat):  34.56

Note: Manual Arellano-Bond via first-difference + IV.
Use deeper lags as instruments for the differenced lagged dependent variable.
Stata Output
. xtset unit_id year
       panel variable:  unit_id (strongly balanced)
        time variable:  year, 2000 to 2010
                delta:  1 unit

. xtabond outcome treatment controls, lags(1) vce(robust)

Arellano-Bond dynamic panel-data estimation     Number of obs     =        800
Group variable: unit_id                         Number of groups  =        100
Time variable: year
                                                Obs per group:
                                                              min =          8
                                                              avg =          8
                                                              max =          8

Number of instruments =     30                  Wald chi2(3)      =     156.78
                                                Prob > chi2       =     0.0000
One-step results
                                   (Std. Err. adjusted for clustering on unit_id)
------------------------------------------------------------------------------
             |             Robust
     outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     outcome |
         L1. |   .4123456   .0876543     4.70   0.000     .2405453    .5841459
   treatment |   .3456789   .0654321     5.28   0.000     .2174344    .4739234
    controls |   .1234567   .0432109     2.86   0.004     .0387648    .2081486
       _cons |   .5678901   .2345678     2.42   0.016     .1081459    1.027634
------------------------------------------------------------------------------

. estat abond

Arellano-Bond test for zero autocorrelation in first-differenced errors
  +-----------------------+
  |Order |  z     Prob > z|
  |------+----------------|
  |   1  |-3.4567  0.0005 |
  |   2  |-0.8765  0.3809 |
  +-----------------------+
  H0: no autocorrelation

AR(1): z = -3.46, p = 0.001 (expected — reject)
AR(2): z = -0.88, p = 0.381 (good — do not reject)
R Output
Onestep Generalized Method of Moments Estimation

Call:
pgmm(outcome ~ lag(outcome, 1) + treatment + controls | lag(outcome,
    2:99), data = df, index = c("unit_id", "year"), effect = "twoways",
    model = "onestep")

Unbalanced Panel: n = 100, T = 8-10, N = 800

Number of Observations Used: 800
Residuals:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
-4.23456  -0.87654  -0.01234   0.00123   0.89012   3.98765

Coefficients:
                Estimate Std. Error z-value  Pr(>|z|)
lag(outcome, 1)  0.41235    0.09123  4.5199 6.19e-06 ***
treatment        0.34568    0.06789  5.0915 3.56e-07 ***
controls         0.12346    0.04567  2.7037  0.00686 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Sargan test: chisq(26) = 28.456 (p-value = 0.3367)
Autocorrelation test (1): normal = -3.4567 (p-value = 0.00055)
Autocorrelation test (2): normal = -0.8765 (p-value = 0.38091)
Wald test for coefficients: chisq(3) = 156.78 (p-value = < 2.22e-16)
Dynamic Panel Pitfalls

Arellano-Bond works best with "large N, small T" panels (many units, few time periods). With long panels, the proliferation of instruments can cause overfitting and weak identification. Always report the Sargan/Hansen test for overidentification (should be insignificant) and the AR(2) test for serial correlation (should also be insignificant to validate the moment conditions).

7.4 Nonlinear Models

Linear models assume the conditional mean of Y given X is a linear function of parameters. This breaks down when:

  • Binary outcomes: Linear probability models can predict probabilities outside [0,1]
  • Count data: Counts are non-negative integers, but OLS can predict negative values
  • Bounded/skewed outcomes: Expenditure data, durations, and other inherently positive variables

Nonlinear models address these issues by modeling a transformed version of the outcome or using appropriate distributional assumptions.

Logit and Probit

For binary outcomes (0/1), logit and probit models estimate the probability P(Y=1|X). They differ in the link function: logit uses the logistic CDF, probit uses the normal CDF. In practice, they give very similar results—the choice is often a matter of convention (probit is traditional in labor economics, logit in epidemiology).

Interpreting the output

Raw logit/probit coefficients are in log-odds (logit) or z-scores (probit), which are hard to interpret. Always compute marginal effects—the change in probability for a unit change in X. Average marginal effects (AME) average across all observations; marginal effects at the mean (MEM) evaluate at sample means. AME is generally preferred as it doesn't depend on a potentially unrepresentative "average" person.

# Python: Logit and Probit
import statsmodels.formula.api as smf

# Logit model (binary outcome)
logit = smf.logit('hired ~ education + experience + age', data=df).fit()
print(logit.summary())

# Probit model
probit = smf.probit('hired ~ education + experience + age', data=df).fit()

# Marginal effects (more interpretable)
margeff = logit.get_margeff()
print(margeff.summary())

# Average marginal effects at specific values
margeff_at_means = logit.get_margeff(at='mean')
* Stata: Logit and Probit

* Logit
logit hired education experience age

* Probit
probit hired education experience age

* Marginal effects (average marginal effects)
margins, dydx(*)

* Marginal effects at specific values
margins, dydx(*) at(education=16 experience=5)

* Odds ratios (for logit)
logit hired education experience age, or
# R: Logit and Probit

# Logit
logit <- glm(hired ~ education + experience + age,
             data = df,
             family = binomial(link = "logit"))
summary(logit)

# Probit
probit <- glm(hired ~ education + experience + age,
              data = df,
              family = binomial(link = "probit"))

# Marginal effects with margins package
library(margins)
margeff <- margins(logit)
summary(margeff)

# Odds ratios
exp(coef(logit))
Python Output
Optimization terminated successfully.
         Current function value: 0.512345
         Iterations 6
                           Logit Regression Results
==============================================================================
Dep. Variable:                  hired   No. Observations:                 1000
Model:                          Logit   Df Residuals:                      996
Method:                           MLE   Df Model:                            3
Date:                Mon, 27 Jan 2026   Pseudo R-squ.:                  0.2187
Converged:                       True   Log-Likelihood:                -512.35
LLR p-value:                 1.23e-42   LL-Null:                       -655.78
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -4.2341      0.521     -8.127      0.000      -5.255      -3.213
education      0.3124      0.038      8.221      0.000       0.238       0.387
experience     0.1876      0.024      7.817      0.000       0.141       0.235
age           -0.0234      0.012     -1.950      0.051      -0.047       0.000
==============================================================================

        Logit Marginal Effects
=====================================
Dep. Variable:                  hired
Method:                          dydx
At:                           overall
==============================================================================
                dy/dx    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
education      0.0654      0.008      8.175      0.000       0.050       0.081
experience     0.0393      0.005      7.860      0.000       0.030       0.049
age           -0.0049      0.003     -1.633      0.102      -0.011       0.001
==============================================================================
Stata Output
. logit hired education experience age

Iteration 0:   log likelihood = -655.78123
Iteration 1:   log likelihood = -518.23456
Iteration 2:   log likelihood = -512.45678
Iteration 3:   log likelihood = -512.34567
Iteration 4:   log likelihood = -512.34561

Logistic regression                             Number of obs     =      1,000
                                                LR chi2(3)        =     286.87
                                                Prob > chi2       =     0.0000
Log likelihood = -512.34561                     Pseudo R2         =     0.2187

------------------------------------------------------------------------------
       hired |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   .3124127   .0380123     8.22   0.000     .2379101    .3869153
  experience |   .1876234   .0240123     7.82   0.000     .1405601    .2346867
         age |  -.0234123   .0120123    -1.95   0.051    -.0469561    .0001315
       _cons |  -4.234127   .5210234    -8.13   0.000    -5.255315   -3.212939
------------------------------------------------------------------------------

. margins, dydx(*)

Average marginal effects                        Number of obs     =      1,000
Model VCE    : OIM

Expression   : Pr(hired), predict()
dy/dx w.r.t. : education experience age

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   .0654231   .0080012     8.18   0.000     .0497411    .0811051
  experience |   .0392876   .0049987     7.86   0.000     .0294904    .0490848
         age |  -.0049012   .0030012    -1.63   0.102    -.0107834    .0009810
------------------------------------------------------------------------------

. logit hired education experience age, or

Logistic regression                             Number of obs     =      1,000
                                                LR chi2(3)        =     286.87
Log likelihood = -512.34561                     Pseudo R2         =     0.2187

------------------------------------------------------------------------------
       hired | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   1.366789   .0519432     8.22   0.000     1.268798    1.472756
  experience |   1.206423   .0289634     7.82   0.000     1.150923    1.264567
         age |   .9768543   .0117234    -1.95   0.051     .9541234    .9999876
       _cons |   .0145234   .0075645    -8.13   0.000     .0052345    .0402567
------------------------------------------------------------------------------
R Output
Call:
glm(formula = hired ~ education + experience + age, family = binomial(link = "logit"),
    data = df)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-2.3456  -0.8765   0.3456   0.7654   2.1234

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.234127   0.521023  -8.127  4.4e-16 ***
education    0.312413   0.038012   8.221  < 2e-16 ***
experience   0.187623   0.024012   7.817  5.4e-15 ***
age         -0.023412   0.012012  -1.950   0.0512 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1311.6  on 999  degrees of freedom
Residual deviance: 1024.7  on 996  degrees of freedom
AIC: 1032.7

Number of Fisher Scoring iterations: 4

 factor     AME     SE       z      p   lower   upper
 education  0.0654  0.0080   8.1750 0.0000  0.0497  0.0811
 experience 0.0393  0.0050   7.8600 0.0000  0.0295  0.0491
 age       -0.0049  0.0030  -1.6333 0.1024 -0.0108  0.0010

Odds ratios:
(Intercept)   education  experience         age
  0.0145234   1.3667894   1.2064231   0.9768543

Manual Marginal Effects for Probit/Logit

Raw logit/probit coefficients are not directly interpretable as the effect on the probability of the outcome. Because the link function is nonlinear, the marginal effect of a variable depends on the values of all covariates. There are three standard approaches to summarizing marginal effects:

  • Average Marginal Effect (AME): Compute the marginal effect for each observation, then average. For probit: AMEj = (1/N) Σ φ(Xiβ) · βj. For logit: AMEj = (1/N) Σ Λ(Xiβ)(1 − Λ(Xiβ)) · βj.
  • Marginal Effect at the Mean (MEM): Evaluate the marginal effect at the sample means of all covariates: φ(x̄β) · βj.
  • Marginal Effect at Representative values (MER): Evaluate at chosen covariate profiles (e.g., education = 16, experience = 5).
AME vs MEM vs MER

AME is generally preferred because it does not depend on a potentially unrepresentative "average" person (e.g., 2.3 children). MEM is convenient but can be misleading if the mean profile is not meaningful. MER is useful for policy questions ("what is the effect for a 30-year-old with 16 years of education?"). All three coincide at the mean only if the model is linear—they typically differ for logit/probit.

# Python: Manual Marginal Effects for Probit
import statsmodels.api as sm
from scipy.stats import norm
import numpy as np

# Probit estimation
probit = sm.Probit(df['hired'], sm.add_constant(df[['education', 'experience', 'age']]))
result = probit.fit()

# Manual AME: mean of individual marginal effects
xb = result.predict(which='linear')  # linear predictor
ame = np.mean(norm.pdf(xb).values[:, None] * result.params.values[1:], axis=0)
print("Manual AME:", ame)

# Automatic marginal effects — AME
mfx = result.get_margeff(at='overall')
mfx.summary()

# Automatic marginal effects — MEM
mfx_mean = result.get_margeff(at='mean')
mfx_mean.summary()
* Stata: Marginal Effects for Probit

* Probit estimation
probit hired education experience age

* Average Marginal Effects (AME) — default
margins, dydx(*)

* Marginal Effects at the Mean (MEM)
margins, dydx(*) atmeans

* Marginal effects at specific values (MER)
margins, dydx(education) at(experience=5 age=30)
# R: Manual Average Marginal Effect (AME) — Probit
probit <- glm(hired ~ education + experience + age,
              family = binomial(link = "probit"), data = df)
beta <- coef(probit)
xb <- predict(probit, type = "link")  # X * beta

# AME = mean of individual marginal effects
ame_education <- mean(dnorm(xb)) * beta["education"]
ame_experience <- mean(dnorm(xb)) * beta["experience"]
cat("AME (education):", round(ame_education, 4), "\n")

# Marginal Effect at the Mean (MEM)
x_bar <- colMeans(model.matrix(probit))
xb_bar <- sum(x_bar * beta)
mem_education <- dnorm(xb_bar) * beta["education"]

# Using margins package (automatic AME)
library(margins)
m <- margins(probit)
summary(m)
Python Output
Optimization terminated successfully.
         Current function value: 0.498231
         Iterations 5

Manual AME: [0.0712  0.0421  -0.0053]

        Probit Marginal Effects (AME)
=====================================
Dep. Variable:                  hired
Method:                          dydx
At:                           overall
==============================================================================
                dy/dx    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
education      0.0712      0.009      7.911      0.000       0.054       0.089
experience     0.0421      0.005      8.420      0.000       0.032       0.052
age           -0.0053      0.003     -1.767      0.077      -0.011       0.001
==============================================================================

        Probit Marginal Effects (MEM)
=====================================
Dep. Variable:                  hired
Method:                          dydx
At:                              mean
==============================================================================
                dy/dx    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
education      0.0698      0.009      7.756      0.000       0.052       0.087
experience     0.0413      0.005      8.260      0.000       0.031       0.051
age           -0.0052      0.003     -1.733      0.083      -0.011       0.001
==============================================================================
Stata Output
. probit hired education experience age

Iteration 0:   log likelihood = -655.78123
Iteration 1:   log likelihood = -500.12345
Iteration 2:   log likelihood = -498.23456
Iteration 3:   log likelihood = -498.23100

Probit regression                               Number of obs     =      1,000
                                                LR chi2(3)        =     315.10
                                                Prob > chi2       =     0.0000
Log likelihood = -498.231                       Pseudo R2         =     0.2403

------------------------------------------------------------------------------
       hired |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   .1912345   .0234567     8.15   0.000     .1452601    .2372089
  experience |   .1131234   .0141234     8.01   0.000     .0854421    .1408047
         age |  -.0142345   .0074567    -1.91   0.056    -.0288494    .0003804
       _cons |  -2.567890   .3201234    -8.02   0.000    -3.195321   -1.940459
------------------------------------------------------------------------------

. margins, dydx(*)

Average marginal effects                        Number of obs     =      1,000

Expression   : Pr(hired), predict()
dy/dx w.r.t. : education experience age

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   .0712345   .0090123     7.91   0.000     .0535707    .0888983
  experience |   .0421234   .0050012     8.42   0.000     .0323212    .0519256
         age |  -.0053012   .0030012    -1.77   0.077    -.0111835    .0005811
------------------------------------------------------------------------------

. margins, dydx(*) atmeans

Conditional marginal effects                    Number of obs     =      1,000

Expression   : Pr(hired), predict()
dy/dx w.r.t. : education experience age
at           : education   =    14.23 (mean)
               experience  =     8.76 (mean)
               age         =    35.12 (mean)

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   .0698123   .0090012     7.76   0.000     .0521702    .0874544
  experience |   .0412876   .0049987     8.26   0.000     .0314904    .0510848
         age |  -.0051912   .0029987    -1.73   0.083    -.0110685    .0006861
------------------------------------------------------------------------------
R Output
AME (education): 0.0712

 factor     AME     SE       z      p   lower   upper
 education  0.0712  0.0090   7.9111 0.0000  0.0536  0.0889
 experience 0.0421  0.0050   8.4200 0.0000  0.0323  0.0519
 age       -0.0053  0.0030  -1.7667 0.0773 -0.0112  0.0006

MEM (education): 0.0698

Comparison of AME vs MEM:
              AME      MEM   Difference
education   0.0712   0.0698      0.0014
experience  0.0421   0.0413      0.0008
age        -0.0053  -0.0052      0.0001

Multinomial Logit

When the outcome is categorical with more than two unordered categories (e.g., occupation choice, transport mode, brand preference), we use the multinomial logit model. It estimates the probability of each category relative to a base category: log(P(Y=j) / P(Y=base)) = Xβj. The model produces one set of coefficients per non-base outcome. Coefficients are in terms of relative risk ratios (exponentiated) or log-odds relative to the base.

Independence of Irrelevant Alternatives (IIA)

Multinomial logit assumes that the ratio of probabilities between any two alternatives is independent of the other alternatives available (IIA property). This is the famous "red bus/blue bus" problem: adding a blue bus should not change the relative odds of taking a car vs. a red bus, but IIA forces it to. If IIA is implausible, consider nested logit, mixed logit, or multinomial probit instead. You can test IIA with the Hausman-McFadden test or the Small-Hsiao test.

# Python: Multinomial Logit
import statsmodels.api as sm

# Multinomial logit using statsmodels
mlogit = sm.MNLogit(df['occupation'],
                    sm.add_constant(df[['education', 'experience', 'age']]))
result = mlogit.fit()
print(result.summary())

# Predicted probabilities for each category
pred_probs = result.predict()
print(pred_probs.head())

# Marginal effects
mfx = result.get_margeff()
mfx.summary()
* Stata: Multinomial Logit

* Multinomial logit with base outcome 1
mlogit occupation education experience age, baseoutcome(1)

* Relative risk ratios (exponentiated coefficients)
mlogit, rrr

* Predicted probabilities
predict p1 p2 p3, pr

* Marginal effects for a specific outcome
margins, dydx(*) predict(outcome(2))
# R: Multinomial Logit
library(nnet)

# Multinomial logit: choice among J alternatives
mlogit <- multinom(occupation ~ education + experience + age, data = df)
summary(mlogit)

# Predicted probabilities
pred_probs <- predict(mlogit, type = "probs")
head(pred_probs)

# Relative risk ratios (exponentiated coefficients)
exp(coef(mlogit))

# Marginal effects: dP_j/dx_k = P_j * (beta_jk - sum_m P_m * beta_mk)
library(margins)
m <- margins(mlogit)
Python Output
Optimization terminated successfully.

                          MNLogit Regression Results
==============================================================================
Dep. Variable:             occupation   No. Observations:                 1000
Model:                        MNLogit   Df Residuals:                      992
Method:                           MLE   Df Model:                            6
Date:                Mon, 27 Jan 2026   Pseudo R-squ.:                  0.1456
Log-Likelihood:                -876.54   LL-Null:                      -1025.8
LLR p-value:                 2.34e-61
==============================================================================
occupation=2       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const            -2.1234      0.412     -5.155      0.000      -2.931      -1.316
education         0.2345      0.031      7.565      0.000       0.174       0.295
experience        0.0876      0.021      4.171      0.000       0.047       0.129
age              -0.0123      0.009     -1.367      0.172      -0.030       0.005
------------------------------------------------------------------------------
occupation=3       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const            -3.4567      0.523     -6.610      0.000      -4.482      -2.432
education         0.3456      0.038      9.095      0.000       0.271       0.420
experience        0.1234      0.025      4.936      0.000       0.074       0.173
age               0.0087      0.011      0.791      0.429      -0.013       0.030
==============================================================================

Predicted probabilities (first 5 obs):
   occupation=1  occupation=2  occupation=3
0      0.4523        0.3214        0.2263
1      0.3876        0.3567        0.2557
2      0.5123        0.2876        0.2001
3      0.2987        0.3876        0.3137
4      0.4234        0.3123        0.2643
Stata Output
. mlogit occupation education experience age, baseoutcome(1)

Iteration 0:   log likelihood = -1025.8123
Iteration 1:   log likelihood =  -882.4567
Iteration 2:   log likelihood =  -876.5678
Iteration 3:   log likelihood =  -876.5432

Multinomial logistic regression                 Number of obs     =      1,000
                                                LR chi2(6)        =     298.54
                                                Prob > chi2       =     0.0000
Log likelihood = -876.5432                      Pseudo R2         =     0.1456

------------------------------------------------------------------------------
  occupation |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
1            |  (base outcome)
-------------+----------------------------------------------------------------
2            |
   education |   .2345123   .0310234     7.57   0.000     .1737076    .2953170
  experience |   .0876234   .0210123     4.17   0.000     .0464399    .1288069
         age |  -.0123456   .0090234    -1.37   0.172    -.0300312    .0053400
       _cons |  -2.123456   .4120234    -5.16   0.000    -2.930987   -1.315925
-------------+----------------------------------------------------------------
3            |
   education |   .3456789   .0380123     9.10   0.000     .2711760    .4201818
  experience |   .1234567   .0250123     4.94   0.000     .0744334    .1724800
         age |   .0087654   .0110234     0.80   0.426    -.0128400    .0303708
       _cons |  -3.456789   .5230123    -6.61   0.000    -4.481874   -2.431704
------------------------------------------------------------------------------
R Output
Call:
multinom(formula = occupation ~ education + experience + age, data = df)

Coefficients:
  (Intercept)  education  experience       age
2   -2.123456  0.2345123   0.0876234 -0.012346
3   -3.456789  0.3456789   0.1234567  0.008765

Std. Errors:
  (Intercept)  education  experience       age
2   0.4120234  0.0310234   0.0210123 0.009023
3   0.5230123  0.0380123   0.0250123 0.011023

Residual Deviance: 1753.086
AIC: 1769.086

Relative risk ratios:
  (Intercept)  education  experience       age
2   0.1196234  1.2643567   1.0915234 0.987716
3   0.0316789  1.4130234   1.1313567 1.008804

Ordered Logit/Probit

When the outcome is ordinal—categories with a natural ordering but unknown spacing (e.g., satisfaction: 1=very dissatisfied to 5=very satisfied, credit ratings, self-reported health)—we use ordered logit (proportional odds model) or ordered probit. These models estimate a latent continuous variable Y* = Xβ + ε, with cutpoints α1 < α2 < … < αJ−1 that map the latent variable to observed categories. The key assumption is proportional odds: the effect of X is the same across all cutpoints.

Proportional Odds Assumption

The proportional odds (parallel lines) assumption means that the coefficient β shifts the entire latent distribution uniformly, regardless of which cutpoint you consider. Test it with the Brant test. If it fails, consider a generalized ordered logit (where coefficients vary by cutpoint) or a multinomial logit that does not impose ordering.

# Python: Ordered Logit/Probit
from statsmodels.miscmodels.ordinal_model import OrderedModel

# Ordered logit
ologit = OrderedModel(df['satisfaction'],
                      df[['income', 'education', 'age']],
                      distr='logit')
result = ologit.fit(method='bfgs')
print(result.summary())

# Ordered probit
oprobit = OrderedModel(df['satisfaction'],
                       df[['income', 'education', 'age']],
                       distr='probit')
result_p = oprobit.fit(method='bfgs')

# Predicted probabilities for each category
pred_probs = result.predict()
print(pred_probs.head())
* Stata: Ordered Logit/Probit

* Ordered logit
ologit satisfaction income education age

* Ordered probit
oprobit satisfaction income education age

* Predicted probabilities for each category
predict p1 p2 p3 p4 p5, pr

* Marginal effects for a specific outcome category
margins, dydx(*) predict(outcome(3))

* Brant test for proportional odds assumption
brant
# R: Ordered Logit/Probit
library(MASS)

# Ordered logit (proportional odds model)
ologit <- polr(factor(satisfaction) ~ income + education + age,
               data = df, method = "logistic")
summary(ologit)

# Ordered probit
oprobit <- polr(factor(satisfaction) ~ income + education + age,
                data = df, method = "probit")

# Predicted probabilities for each category
pred_probs <- predict(ologit, type = "probs")

# Brant test for proportional odds assumption
library(brant)
brant(ologit)
Python Output
Optimization terminated successfully.

                             OrderedModel Results
==============================================================================
Dep. Variable:           satisfaction   Log-Likelihood:                -1234.5
Model:                   OrderedModel   AIC:                            2483.0
Method:            Maximum Likelihood   BIC:                            2517.3
No. Observations:                1000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
income         0.0234      0.004      5.850      0.000       0.016       0.031
education      0.1567      0.028      5.596      0.000       0.102       0.212
age            0.0098      0.006      1.633      0.103      -0.002       0.022
1/2           -1.2345      0.234     -5.276      0.000      -1.693      -0.776
2/3           -0.1234      0.198     -0.623      0.533      -0.512       0.265
3/4            0.8765      0.201      4.361      0.000       0.483       1.270
4/5            2.1234      0.267      7.952      0.000       1.600       2.647
==============================================================================

Predicted probabilities (first 5 obs):
     P(Y=1)   P(Y=2)   P(Y=3)   P(Y=4)   P(Y=5)
0    0.0876   0.1543   0.2876   0.3123   0.1582
1    0.0654   0.1234   0.2567   0.3456   0.2089
2    0.1123   0.1876   0.3012   0.2765   0.1224
3    0.0432   0.0987   0.2234   0.3567   0.2780
4    0.0765   0.1456   0.2789   0.3234   0.1756
Stata Output
. ologit satisfaction income education age

Iteration 0:   log likelihood = -1456.7890
Iteration 1:   log likelihood = -1238.4567
Iteration 2:   log likelihood = -1234.5678
Iteration 3:   log likelihood = -1234.5432

Ordered logistic regression                     Number of obs     =      1,000
                                                LR chi2(3)        =     444.49
                                                Prob > chi2       =     0.0000
Log likelihood = -1234.5432                     Pseudo R2         =     0.1525

------------------------------------------------------------------------------
satisfaction |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      income |   .0234123   .0040012     5.85   0.000     .0155702    .0312544
   education |   .1567234   .0280123     5.60   0.000     .1018202    .2116266
         age |   .0098765   .0060456     1.63   0.103    -.0019726    .0217256
-------------+----------------------------------------------------------------
       /cut1 |  -1.234567   .2340234                     -1.693244   -.775890
       /cut2 |  -.1234567   .1980234                     -.511575     .264662
       /cut3 |   .8765432   .2010234                      .482545    1.270541
       /cut4 |   2.123456   .2670234                      1.600099    2.646813
------------------------------------------------------------------------------

. brant

Brant Test of Parallel Regression Assumption

         chi2     p>chi2        df
-----------------------------------------
All      8.234    0.2216         6
income   2.123    0.5467         2
education 3.456   0.1776         2
age      1.876    0.3912         2
-----------------------------------------

A non-significant test indicates that the proportional
odds assumption has NOT been violated.
R Output
Call:
polr(formula = factor(satisfaction) ~ income + education + age,
    data = df, method = "logistic")

Coefficients:
              Value Std. Error t value
income      0.02341   0.004001  5.8510
education   0.15672   0.028012  5.5950
age         0.00988   0.006046  1.6340

Intercepts:
    Value    Std. Error t value
1|2 -1.2346   0.2340    -5.2762
2|3 -0.1235   0.1980    -0.6234
3|4  0.8765   0.2010     4.3610
4|5  2.1234   0.2670     7.9520

Residual Deviance: 2469.09
AIC: 2483.09

--------------------------------------------
Brant Test (H0: Parallel Regression)
                   X2    df  probability
Omnibus Test    8.234     6     0.2216
income          2.123     2     0.5467
education       3.456     2     0.1776
age             1.876     2     0.3912
--------------------------------------------
Parallel regression assumption holds (p > 0.05)

Bivariate Probit

The bivariate probit model jointly estimates two binary outcomes whose error terms may be correlated: Y1* = X1β1 + ε1 and Y2* = X2β2 + ε2, where corr(ε1, ε2) = ρ. Key applications include: (1) selection models where the outcome equation and selection equation are correlated, (2) IV estimation when both the endogenous variable and the outcome are binary, and (3) modeling any two inherently related binary decisions (e.g., labor force participation and home ownership).

When to Use Bivariate Probit vs. 2SLS

When you have a binary endogenous regressor and a binary outcome, standard 2SLS can still be used and is consistent (though not efficient) if the first stage is well-specified. However, bivariate probit is more efficient under correct specification because it models the joint distribution. The test of ρ = 0 serves as an endogeneity test. If ρ is not significantly different from zero, single-equation probit suffices.

# Python: Bivariate Probit via custom MLE
# No standard package — implement via maximum likelihood
from scipy.optimize import minimize
from scipy.stats import mvn
import numpy as np

def biprobit_ll(params, X1, X2, y1, y2):
    k1 = X1.shape[1]
    k2 = X2.shape[1]
    beta1 = params[:k1]
    beta2 = params[k1:k1+k2]
    rho = np.tanh(params[-1])  # ensure -1 < rho < 1

    xb1 = X1 @ beta1
    xb2 = X2 @ beta2

    # Log-likelihood using bivariate normal CDF
    ll = 0
    for i in range(len(y1)):
        lower = [-np.inf if y1[i]==0 else 0,
                 -np.inf if y2[i]==0 else 0]
        upper = [0 if y1[i]==0 else np.inf,
                 0 if y2[i]==0 else np.inf]
        # Shift by linear predictor
        lower_adj = [lower[0]-xb1[i], lower[1]-xb2[i]]
        upper_adj = [upper[0]-xb1[i], upper[1]-xb2[i]]
        cov = np.array([[1, rho],[rho, 1]])
        prob, _ = mvn.mvnun(lower_adj, upper_adj, [0,0], cov)
        ll += np.log(max(prob, 1e-10))
    return -ll

# In practice, use Stata or R for bivariate probit
* Stata: Bivariate Probit

* Bivariate probit with exclusion restriction
biprobit (y1 = x1 x2 z1) (y2 = x1 x2)

* Test rho = 0 (exogeneity test)
* Reported automatically in output as Wald test of rho=0

* Predicted probabilities
predict p11, p11    /* Pr(Y1=1, Y2=1) */
predict p1,  pmarg1  /* Pr(Y1=1) marginal */
# R: Bivariate Probit using GJRM
library(GJRM)

# Specify two equations (with exclusion restriction in eq 1)
f.list <- list(y1 ~ x1 + x2 + z1,    # equation 1
               y2 ~ x1 + x2)           # equation 2

# Bivariate probit estimation
biprobit <- gjrm(f.list, data = df,
                 margins = c("probit", "probit"),
                 Model = "B")
summary(biprobit)

# Alternatively, using mvProbit (simulated ML)
# library(mvProbit)
# mvProbit(cbind(y1, y2) ~ x1 + x2, data = df)
Python Output
# Bivariate probit typically estimated in Stata or R.
# Python custom MLE output:
Optimization terminated successfully.

Estimated parameters:
  Equation 1 (y1):
    const:  -0.8765  (0.2345)
    x1:      0.4567  (0.0876)
    x2:      0.2345  (0.0654)
    z1:      0.3210  (0.0987)
  Equation 2 (y2):
    const:  -0.5432  (0.1987)
    x1:      0.3456  (0.0765)
    x2:      0.1876  (0.0543)

  rho:       0.3456  (0.0876)
  Wald test of rho=0: chi2(1) = 15.56, p = 0.0001
Stata Output
. biprobit (y1 = x1 x2 z1) (y2 = x1 x2)

Fitting comparison equation 1:
Iteration 0:   log likelihood = -567.8901
Iteration 1:   log likelihood = -523.4567

Fitting comparison equation 2:
Iteration 0:   log likelihood = -612.3456
Iteration 1:   log likelihood = -578.9012

Fitting full model:
Iteration 0:   log likelihood = -1098.7654
Iteration 1:   log likelihood = -1045.6789
Iteration 2:   log likelihood = -1042.3456
Iteration 3:   log likelihood = -1042.3412

Bivariate probit regression                     Number of obs     =      1,000

                                                Wald chi2(7)      =     187.65
Log likelihood = -1042.3412                     Prob > chi2       =     0.0000

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
y1           |
          x1 |   .4567123   .0876234     5.21   0.000     .2849736    .6284510
          x2 |   .2345678   .0654321     3.58   0.000     .1063233    .3628123
          z1 |   .3210456   .0987654     3.25   0.001     .1274690    .5146222
       _cons |  -.8765432   .2345678    -3.74   0.000    -1.336287   -.416799
-------------+----------------------------------------------------------------
y2           |
          x1 |   .3456789   .0765432     4.51   0.000     .1956570    .4957008
          x2 |   .1876543   .0543210     3.45   0.001     .0811870    .2941216
       _cons |  -.5432109   .1987654    -2.73   0.006    -.9327839   -.1536379
-------------+----------------------------------------------------------------
     /athrho |   .3598765   .0912345     3.95   0.000     .1810601    .5386929
-------------+----------------------------------------------------------------
         rho |   .3456123   .0812345                      .1793456    .4924567
------------------------------------------------------------------------------

Wald test of rho=0:                 chi2(1) =  15.5612   Prob > chi2 = 0.0001
R Output
GJRM Bivariate Probit Model

Equation 1: y1 ~ x1 + x2 + z1
              Estimate  Std. Error  z value  Pr(>|z|)
(Intercept)  -0.876543    0.234568   -3.737   0.0002 ***
x1            0.456712    0.087623    5.213   0.0000 ***
x2            0.234568    0.065432    3.584   0.0003 ***
z1            0.321046    0.098765    3.250   0.0012 **

Equation 2: y2 ~ x1 + x2
              Estimate  Std. Error  z value  Pr(>|z|)
(Intercept)  -0.543211    0.198765   -2.733   0.0063 **
x1            0.345679    0.076543    4.515   0.0000 ***
x2            0.187654    0.054321    3.454   0.0006 ***

Estimated rho: 0.3456  (SE: 0.0812)
Wald test of rho=0: chi2 = 15.56, p-value = 0.0001

n = 1000, Log-Lik = -1042.34

Control Function in Nonlinear Models

Standard 2SLS does not produce consistent estimates in nonlinear models (probit, logit, Poisson) because the endogeneity bias does not average out as it does in the linear case. The control function approach (Rivers & Vuong, 1988) provides a solution: (1) estimate a first-stage OLS regression of the endogenous variable on instruments and controls, (2) include the first-stage residuals as an additional regressor in the nonlinear second stage. The coefficient on the residuals serves as an endogeneity test: if significantly different from zero, the variable is endogenous.

Bootstrap Standard Errors Required

Naive standard errors from step 2 are incorrect because they ignore the estimation uncertainty from step 1 (the generated regressor problem). Always bootstrap the entire two-step procedure to obtain valid standard errors and confidence intervals. This means re-estimating both the first stage and second stage in each bootstrap iteration.

For the linear version of the control function approach, see Module 6D: Control Function.

# Python: Control Function Approach for Probit
import statsmodels.api as sm
import numpy as np
from sklearn.utils import resample

# Step 1: First-stage OLS
X_fs = sm.add_constant(df[['instrument', 'controls']])
fs = sm.OLS(df['endogenous_x'], X_fs).fit()
df['v_hat'] = fs.resid

# Step 2: Control function probit (include first-stage residuals)
X_cf = sm.add_constant(df[['endogenous_x', 'controls', 'v_hat']])
cf_probit = sm.Probit(df['y'], X_cf).fit()
print(cf_probit.summary())

# Test endogeneity: check significance of v_hat
print(f"v_hat coef: {cf_probit.params['v_hat']:.4f}")
print(f"v_hat p-value: {cf_probit.pvalues['v_hat']:.4f}")

# Step 3: Bootstrap for correct SEs
boot_coefs = []
for _ in range(999):
    boot_df = resample(df)
    X_fs_b = sm.add_constant(boot_df[['instrument', 'controls']])
    fs_b = sm.OLS(boot_df['endogenous_x'], X_fs_b).fit()
    boot_df['v_hat'] = fs_b.resid
    X_cf_b = sm.add_constant(boot_df[['endogenous_x', 'controls', 'v_hat']])
    cf_b = sm.Probit(boot_df['y'], X_cf_b).fit(disp=0)
    boot_coefs.append(cf_b.params.values)
boot_se = np.std(boot_coefs, axis=0)
print("Bootstrap SEs:", boot_se)
* Stata: Control Function in Probit

* Step 1: First stage OLS
reg endogenous_x instrument controls
predict v_hat, resid

* Step 2: Probit with first-stage residuals
probit y endogenous_x controls v_hat

* Test endogeneity (is v_hat significant?)
test v_hat

* Step 3: Bootstrap for correct SEs
bootstrap, reps(999): ///
    probit_cf y endogenous_x controls, ///
    endog(endogenous_x) inst(instrument)

* Or use ivprobit directly (MLE-based)
ivprobit y controls (endogenous_x = instrument)
# R: Control Function Approach for Probit

# Step 1: First-stage OLS
first_stage <- lm(endogenous_x ~ instrument + controls, data = df)
df$v_hat <- residuals(first_stage)

# Step 2: Probit with residuals (Rivers-Vuong, 1988)
cf_probit <- glm(y ~ endogenous_x + controls + v_hat,
                 family = binomial(link = "probit"), data = df)
summary(cf_probit)

# Test endogeneity: is v_hat significant?
# H0: endogenous_x is exogenous (v_hat coef = 0)

# Step 3: Bootstrap for correct standard errors
library(boot)
cf_boot <- function(data, indices) {
  d <- data[indices, ]
  fs <- lm(endogenous_x ~ instrument + controls, data = d)
  d$v_hat <- residuals(fs)
  ss <- glm(y ~ endogenous_x + controls + v_hat,
            family = binomial(link = "probit"), data = d)
  coef(ss)
}
boot_results <- boot(df, cf_boot, R = 999)
# Bootstrap SEs
apply(boot_results$t, 2, sd)
Python Output
Optimization terminated successfully.
         Current function value: 0.467234
         Iterations 6

                     Control Function Probit Results
==============================================================================
Dep. Variable:                      y   No. Observations:                 1000
Model:                         Probit   Df Residuals:                      996
Method:                           MLE   Df Model:                            3
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.2345      0.312     -3.957      0.000      -1.846      -0.623
endogenous_x   0.4567      0.098      4.660      0.000       0.265       0.649
controls       0.1234      0.045      2.742      0.006       0.035       0.212
v_hat         -0.2876      0.087     -3.306      0.001      -0.458      -0.117
==============================================================================

v_hat coef: -0.2876
v_hat p-value: 0.0009   (endogeneity confirmed)

Bootstrap SEs (999 replications):
  const: 0.3345   endogenous_x: 0.1123   controls: 0.0498   v_hat: 0.0956

Note: Naive SEs underestimate uncertainty. Bootstrap SEs are larger.
Stata Output
. reg endogenous_x instrument controls

      Source |       SS           df       MS      Number of obs   =     1,000
-------------+----------------------------------   F(2, 997)       =    123.45
       Model |  234.567890         2  117.283945   Prob > F        =    0.0000
    Residual |  876.543210       997    .879182    R-squared       =    0.2112
-------------+----------------------------------   Root MSE        =    .93762
       Total |  1111.11110       999  1.1122233

------------------------------------------------------------------------------
endogenous_x |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  instrument |   .5678901   .0456789    12.43   0.000     .4782345    .6575457
    controls |   .2345678   .0345678     6.79   0.000     .1667543    .3023813
       _cons |   1.234567   .1234567    10.00   0.000     .9923456    1.476789
------------------------------------------------------------------------------

. probit y endogenous_x controls v_hat

Probit regression                               Number of obs     =      1,000
                                                Wald chi2(3)      =      67.89
                                                Prob > chi2       =     0.0000
Log likelihood = -467.2345                      Pseudo R2         =     0.2567

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
endogenous_x |   .4567234   .0980123     4.66   0.000     .2646228    .6488240
    controls |   .1234567   .0450123     2.74   0.006     .0352342    .2116792
       v_hat |  -.2876543   .0870234    -3.31   0.001    -.4582171   -.1170915
       _cons |  -1.234567   .3120345    -3.96   0.000    -1.846143   -.622991
------------------------------------------------------------------------------

. test v_hat
 ( 1)  [y]v_hat = 0
           chi2(  1) =   10.93
         Prob > chi2 =   0.0009     (endogeneity confirmed)

. ivprobit y controls (endogenous_x = instrument)

Probit model with endogenous regressors          Number of obs   =      1,000
                                                 Wald chi2(2)    =      45.67
Log likelihood = -467.8901                       Prob > chi2     =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
endogenous_x |   .4523456   .1123456     4.03   0.000     .2321523    .6725389
    controls |   .1198765   .0498765     2.40   0.016     .0221202    .2176328
       _cons |  -1.212345   .3345678    -3.62   0.000    -1.868086   -.556604
------------------------------------------------------------------------------

Wald test of exogeneity:  chi2(1) = 10.89   Prob > chi2 = 0.0010
R Output
Step 1 — First Stage:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)    1.23457    0.12346  10.000   <2e-16 ***
instrument     0.56789    0.04568  12.432   <2e-16 ***
controls       0.23457    0.03457   6.786  1.8e-11 ***
---
F-statistic: 123.5 on 2 and 997 DF,  p-value: < 2.2e-16

Step 2 — Control Function Probit:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)  -1.234567   0.312035  -3.957  0.00008 ***
endogenous_x  0.456723   0.098012   4.660  3.2e-06 ***
controls      0.123457   0.045012   2.743  0.00609 **
v_hat        -0.287654   0.087023  -3.306  0.00095 ***

Endogeneity test (v_hat): z = -3.306, p = 0.00095
  => Reject H0 of exogeneity

Bootstrap Standard Errors (999 replications):
              Naive SE   Boot SE   Ratio
(Intercept)    0.3120     0.3345   1.072
endogenous_x   0.0980     0.1123   1.146
controls       0.0450     0.0498   1.107
v_hat          0.0870     0.0956   1.099

Poisson and Negative Binomial (Count Data)

Count outcomes (number of patents, doctor visits, accidents) require models that respect their non-negative, integer nature. The Poisson model assumes E[Y|X] = exp(Xβ) and Var(Y|X) = E[Y|X]—the mean equals the variance. This "equidispersion" assumption often fails in practice; when variance exceeds the mean (overdispersion), use negative binomial instead.

Pseudo-Poisson MLE (PPML)

An important insight: Poisson regression with robust standard errors is consistent even if the data aren't Poisson-distributed, as long as the conditional mean is correctly specified. This "Poisson Pseudo-MLE" is widely used in trade economics for gravity equations (Santos Silva & Tenreyro, 2006) because it handles zeros naturally and is robust to heteroskedasticity.

# Python: Poisson Regression
import statsmodels.formula.api as smf

# Poisson (count data)
poisson = smf.poisson('num_patents ~ rd_spending + firm_size', data=df).fit()
print(poisson.summary())

# Negative binomial (allows overdispersion)
from statsmodels.discrete.discrete_model import NegativeBinomial
nb = smf.negativebinomial('num_patents ~ rd_spending + firm_size', data=df).fit()

# Robust Poisson (PPML for trade gravity equations)
poisson_robust = smf.poisson('trade ~ distance + gdp_origin + gdp_dest', data=df).fit(
    cov_type='HC1'
)
* Stata: Count Data Models

* Poisson
poisson num_patents rd_spending firm_size

* Negative binomial
nbreg num_patents rd_spending firm_size

* Poisson with robust SEs (PPML)
ppmlhdfe trade distance gdp_origin gdp_dest, absorb(origin dest) cluster(pair)

* Check for overdispersion
estat gof
# R: Count Data Models

# Poisson
poisson <- glm(num_patents ~ rd_spending + firm_size,
               data = df,
               family = poisson())
summary(poisson)

# Negative binomial
library(MASS)
nb <- glm.nb(num_patents ~ rd_spending + firm_size, data = df)
summary(nb)

# With fixest (fast, with fixed effects)
library(fixest)
poisson_fe <- fepois(num_patents ~ rd_spending | firm_id + year,
                     data = df)
Python Output
Optimization terminated successfully.
         Current function value: 2.345678
         Iterations 5
                          Poisson Regression Results
==============================================================================
Dep. Variable:            num_patents   No. Observations:                  800
Model:                        Poisson   Df Residuals:                      797
Method:                           MLE   Df Model:                            2
Date:                Mon, 27 Jan 2026   Pseudo R-squ.:                  0.1876
Converged:                       True   Log-Likelihood:                -1876.5
LLR p-value:                 3.45e-38   LL-Null:                       -2310.2
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.4523      0.087      5.199      0.000       0.282       0.623
rd_spending    0.0234      0.003      7.800      0.000       0.018       0.029
firm_size      0.0876      0.012      7.300      0.000       0.064       0.111
==============================================================================

Interpretation: A 1 million increase in R&D spending is associated with
exp(0.0234) = 1.024 times more patents (2.4% increase).

Mean of num_patents: 4.32
Variance of num_patents: 12.87
Dispersion ratio: 2.98 (evidence of overdispersion)
Stata Output
. poisson num_patents rd_spending firm_size

Iteration 0:   log likelihood = -2310.2345
Iteration 1:   log likelihood = -1889.4567
Iteration 2:   log likelihood = -1876.7890
Iteration 3:   log likelihood = -1876.5432
Iteration 4:   log likelihood = -1876.5432

Poisson regression                              Number of obs     =        800
                                                LR chi2(2)        =     867.38
                                                Prob > chi2       =     0.0000
Log likelihood = -1876.5432                     Pseudo R2         =     0.1876

------------------------------------------------------------------------------
 num_patents |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 rd_spending |   .0234123   .0030012     7.80   0.000     .0175301    .0292945
   firm_size |   .0876234   .0120012     7.30   0.000     .0640816    .1111652
       _cons |   .4523456   .0870123     5.20   0.000     .2818036    .6228876
------------------------------------------------------------------------------

. estat gof

         Goodness-of-fit chi2  =  2345.678
         Prob > chi2(797)      =    0.0000

Note: Large chi2 suggests overdispersion. Consider negative binomial.

. nbreg num_patents rd_spending firm_size

Negative binomial regression                    Number of obs     =        800
                                                LR chi2(2)        =     234.56
Dispersion     = mean                           Prob > chi2       =     0.0000
Log likelihood = -1654.3210                     Pseudo R2         =     0.0662

------------------------------------------------------------------------------
 num_patents |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 rd_spending |   .0198765   .0042345     4.70   0.000     .0115771    .0281759
   firm_size |   .0756432   .0165432     4.57   0.000     .0432191    .1080673
       _cons |   .5234567   .1123456     4.66   0.000     .3032615    .7436519
-------------+----------------------------------------------------------------
    /lnalpha |   .4567891   .0876543                      .2849888    .6285894
-------------+----------------------------------------------------------------
       alpha |   1.578912   .1384567                      1.329876    1.874987
------------------------------------------------------------------------------
Likelihood-ratio test of alpha=0: chibar2(01) = 444.44 Prob >= chibar2 = 0.000
R Output
Call:
glm(formula = num_patents ~ rd_spending + firm_size, family = poisson(),
    data = df)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-3.4567  -1.2345  -0.2345   0.8765   4.5678

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.452346   0.087012   5.199 2.01e-07 ***
rd_spending  0.023412   0.003001   7.800 6.21e-15 ***
firm_size    0.087623   0.012001   7.300 2.87e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4620.5  on 799  degrees of freedom
Residual deviance: 3753.1  on 797  degrees of freedom
AIC: 3759.1

Overdispersion test: ratio = 2.98 (variance/mean)
Consider negative binomial regression.

Call: glm.nb(formula = num_patents ~ rd_spending + firm_size)

              Estimate Std. Error z value Pr(>|z|)
(Intercept)   0.523457   0.112346   4.659 3.17e-06 ***
rd_spending   0.019877   0.004235   4.695 2.67e-06 ***
firm_size     0.075643   0.016543   4.573 4.80e-06 ***

Theta:  1.5789
Std. Err.:  0.1385

2 x log-likelihood:  -3304.642

7.5 MLE and GMM

Maximum Likelihood Estimation (MLE) and Generalized Method of Moments (GMM) are the two general frameworks underlying most econometric estimators. Understanding them helps you: (1) implement custom models not available in standard packages, (2) understand the properties of standard estimators, and (3) diagnose estimation problems.

Maximum Likelihood (MLE)

Find parameters that maximize the probability of observing the data.

  • Requires specifying the full distribution of Y|X
  • Efficient if the distribution is correct
  • Inconsistent if distribution is misspecified
  • Examples: Logit, Probit, Tobit, Poisson

Generalized Method of Moments (GMM)

Find parameters that make moment conditions hold in the sample.

  • Only requires specifying moment conditions E[g(Y, X, θ)] = 0
  • More robust but less efficient than MLE
  • Includes OLS and IV as special cases
  • Examples: IV/2SLS, panel GMM, difference GMM

Custom MLE

Sometimes you need to estimate a model that isn't available in standard packages. The approach is straightforward: (1) write down the log-likelihood function, (2) use numerical optimization to find the maximum. The code below shows a simple example—linear regression estimated via MLE rather than OLS.

What to look for in the code

The key components are: (1) defining the negative log-likelihood function (negative because optimizers minimize by default), (2) choosing starting values, and (3) selecting an optimization algorithm (BFGS is a good default). Standard errors come from the inverse of the Hessian matrix at the optimum. For complex models, check that the optimizer converged and try multiple starting values.

# Python: Custom MLE
from scipy.optimize import minimize
import numpy as np

def neg_log_likelihood(params, X, y):
    """Negative log-likelihood for linear regression."""
    beta = params[:-1]
    sigma = params[-1]
    if sigma <= 0:
        return np.inf

    y_hat = X @ beta
    residuals = y - y_hat

    # Normal log-likelihood
    ll = -0.5 * len(y) * np.log(2 * np.pi * sigma**2)
    ll -= np.sum(residuals**2) / (2 * sigma**2)
    return -ll  # negative because we minimize

# Estimate
X = np.column_stack([np.ones(len(df)), df['x1'], df['x2']])
y = df['y'].values
initial = np.zeros(X.shape[1] + 1)
initial[-1] = 1  # initial sigma

result = minimize(neg_log_likelihood, initial, args=(X, y), method='BFGS')
print("MLE estimates:", result.x)
* Stata: Custom MLE with ml

* Define the log-likelihood function
program mylogit_ll
    args lnf xb
    quietly replace `lnf' = $ML_y1 * `xb' - ln(1 + exp(`xb'))
end

* Estimate
ml model lf mylogit_ll (outcome = x1 x2)
ml maximize
# R: Custom MLE

# Define negative log-likelihood
neg_ll <- function(params, X, y) {
  beta <- params[-length(params)]
  sigma <- params[length(params)]
  if (sigma <= 0) return(Inf)

  y_hat <- X %*% beta
  resid <- y - y_hat

  ll <- -0.5 * length(y) * log(2 * pi * sigma^2)
  ll <- ll - sum(resid^2) / (2 * sigma^2)
  return(-ll)
}

# Estimate
X <- cbind(1, df$x1, df$x2)
y <- df$y
initial <- c(rep(0, ncol(X)), 1)

result <- optim(initial, neg_ll, X = X, y = y, method = "BFGS", hessian = TRUE)
print(result$par)

# Standard errors from Hessian
se <- sqrt(diag(solve(result$hessian)))
Python Output
Optimization terminated successfully.
         Current function value: 567.234567
         Iterations: 23
         Function evaluations: 45
         Gradient evaluations: 45

MLE estimates: [ 2.34567123  1.87654321  0.54321098  1.23456789]

Parameter interpretation:
  beta_0 (intercept): 2.3457
  beta_1 (x1):        1.8765
  beta_2 (x2):        0.5432
  sigma:              1.2346

Comparison with OLS:
                   MLE          OLS
  Intercept      2.3457       2.3457
  x1             1.8765       1.8765
  x2             0.5432       0.5432
  Sigma/RMSE     1.2346       1.2346

Note: MLE and OLS give identical point estimates for linear regression
with normally distributed errors. MLE uses n in denominator for sigma,
OLS uses (n-k) giving slightly larger estimate.
Stata Output
. ml model lf mylogit_ll (outcome = x1 x2)

. ml maximize

initial:       log likelihood = -345.67891
alternative:   log likelihood = -234.56789
rescale:       log likelihood = -189.01234
Iteration 0:   log likelihood = -189.01234
Iteration 1:   log likelihood = -156.78901
Iteration 2:   log likelihood = -145.67890
Iteration 3:   log likelihood = -145.23456
Iteration 4:   log likelihood = -145.23451
Iteration 5:   log likelihood = -145.23451

Maximum likelihood estimation                   Number of obs     =        500

                                                Wald chi2(2)      =     123.45
Log likelihood = -145.23451                     Prob > chi2       =     0.0000

------------------------------------------------------------------------------
     outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   1.234567   .1234567    10.00   0.000     .9925922    1.476542
          x2 |   .5678901   .0876543     6.48   0.000     .3960899    .7396903
       _cons |  -2.345678   .3456789    -6.79   0.000    -3.023197   -1.668159
------------------------------------------------------------------------------

Note: Custom MLE converged successfully.
Log-likelihood at maximum: -145.23451
R Output
$par
[1]  2.34567123  1.87654321  0.54321098  1.23456789

$value
[1] 567.2346

$counts
function gradient
      45       45

$convergence
[1] 0

$message
NULL

Parameter estimates:
  Intercept:  2.3457
  x1:         1.8765
  x2:         0.5432
  sigma:      1.2346

Standard errors (from Hessian):
  SE(Intercept): 0.2134
  SE(x1):        0.1023
  SE(x2):        0.0876
  SE(sigma):     0.0392

95% Confidence intervals:
              Estimate     SE    Lower    Upper
Intercept       2.3457 0.2134   1.9274   2.7640
x1              1.8765 0.1023   1.6760   2.0770
x2              0.5432 0.0876   0.3715   0.7149
sigma           1.2346 0.0392   1.1578   1.3114

Log-likelihood at maximum: -567.2346
AIC: 1142.469
BIC: 1159.234

Newton-Raphson Implementation

Newton-Raphson is the iterative algorithm behind most MLE estimators. Understanding it helps diagnose convergence issues. The update rule is: θt+1 = θt − H−1t) · g(θt), where g is the gradient (score) and H is the Hessian.

import numpy as np
from scipy.special import expit  # logistic function

def newton_raphson(X, y, tol=1e-8, max_iter=100):
    """Newton-Raphson for logistic regression."""
    beta = np.zeros(X.shape[1])

    for iteration in range(max_iter):
        # Predicted probabilities
        p = expit(X @ beta)

        # Score (gradient)
        score = X.T @ (y - p)

        # Hessian
        W = np.diag(p * (1 - p))
        hessian = -X.T @ W @ X

        # Update
        delta = np.linalg.solve(hessian, score)
        beta = beta - delta

        if np.max(np.abs(delta)) < tol:
            print(f"Converged in {iteration + 1} iterations")
            break

    se = np.sqrt(np.diag(np.linalg.inv(-hessian)))
    return {'coef': beta, 'se': se, 'iterations': iteration + 1}

# Apply
X = np.column_stack([np.ones(len(df)), df['education'],
                      df['experience'], df['age']])
y = df['hired'].values
result = newton_raphson(X, y)
* Stata uses Newton-Raphson by default for ml
* You can see iterations:
logit hired education experience age, iterate(100) trace

* Mata implementation of Newton-Raphson
mata:
void newton_raphson(string scalar yvar, string scalar xvars) {
    y = st_data(., yvar)
    X = st_data(., tokens(xvars))
    X = J(rows(X),1,1), X  // add constant

    beta = J(cols(X),1,0)  // starting values

    for (iter=1; iter<=100; iter++) {
        p = invlogit(X*beta)
        score = X'*(y - p)
        W = diag(p :* (1:-p))
        H = -X'*W*X
        delta = lusolve(H, score)
        beta = beta - delta
        if (max(abs(delta)) < 1e-8) break
    }

    se = sqrt(diagonal(luinv(-H)))
    beta, se
}
newton_raphson("hired", "education experience age")
end
# Newton-Raphson for logistic regression (manual implementation)
newton_raphson <- function(X, y, tol = 1e-8, max_iter = 100) {
  beta <- rep(0, ncol(X))  # starting values

  for (iter in 1:max_iter) {
    # Predicted probabilities
    p <- plogis(X %*% beta)  # 1 / (1 + exp(-Xb))

    # Score (gradient of log-likelihood)
    score <- t(X) %*% (y - p)

    # Hessian (second derivative of log-likelihood)
    W <- diag(as.vector(p * (1 - p)))
    hessian <- -t(X) %*% W %*% X

    # Newton-Raphson update: beta_new = beta - H^{-1} * score
    delta <- solve(hessian) %*% score
    beta <- beta - delta

    # Check convergence
    if (max(abs(delta)) < tol) {
      cat("Converged in", iter, "iterations\n")
      break
    }
  }

  # Standard errors from information matrix
  se <- sqrt(diag(solve(-hessian)))

  list(coefficients = as.vector(beta), se = se,
       iterations = iter, hessian = hessian)
}

# Apply to data
X <- cbind(1, df$education, df$experience, df$age)
y <- df$hired
nr_result <- newton_raphson(X, y)

# Compare with glm()
glm_result <- glm(hired ~ education + experience + age,
                   family = binomial, data = df)
cbind(NR = nr_result$coefficients, GLM = coef(glm_result))

# Alternative: nlm() — uses Newton-type algorithm
neg_ll <- function(beta, X, y) {
  p <- plogis(X %*% beta)
  -sum(y * log(p) + (1 - y) * log(1 - p))
}
nlm_result <- nlm(neg_ll, rep(0, 4), X = X, y = y, hessian = TRUE)
Python Output
Converged in 6 iterations

Newton-Raphson Logit Estimates:
==============================================================================
                 coef         se          z      P>|z|
------------------------------------------------------------------------------
Intercept     -4.2341     0.5210     -8.127      0.000
education      0.3124     0.0380      8.221      0.000
experience     0.1876     0.0240      7.817      0.000
age           -0.0234     0.0120     -1.950      0.051
==============================================================================

Comparison with statsmodels logit():
                   NR          statsmodels
  Intercept    -4.2341         -4.2341
  education     0.3124          0.3124
  experience    0.1876          0.1876
  age          -0.0234         -0.0234

Note: Estimates match exactly — Newton-Raphson is the algorithm
statsmodels uses internally for logit/probit.
Stata Output
. logit hired education experience age, iterate(100) trace

Iteration 0:   log likelihood = -655.78123
               education:  beta =   0.00000000
               experience: beta =   0.00000000
               age:        beta =   0.00000000
               _cons:      beta =   0.00000000
Iteration 1:   log likelihood = -518.23456
               education:  beta =   0.23456789
               experience: beta =   0.14567890
               age:        beta =  -0.01567890
               _cons:      beta =  -3.12345678
Iteration 2:   log likelihood = -512.45678
Iteration 3:   log likelihood = -512.34567
Iteration 4:   log likelihood = -512.34561
Iteration 5:   log likelihood = -512.34561

Logistic regression                             Number of obs     =      1,000
                                                LR chi2(3)        =     286.87
                                                Prob > chi2       =     0.0000
Log likelihood = -512.34561                     Pseudo R2         =     0.2187

------------------------------------------------------------------------------
       hired |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   .3124127   .0380123     8.22   0.000     .2379101    .3869153
  experience |   .1876234   .0240123     7.82   0.000     .1405601    .2346867
         age |  -.0234123   .0120123    -1.95   0.051    -.0469561    .0001315
       _cons |  -4.234127   .5210234    -8.13   0.000    -5.255315   -3.212939
------------------------------------------------------------------------------

. mata:
: newton_raphson("hired", "education experience age")

Converged in 5 iterations.

               beta          se
    +--------------------------+
  1 |   .3124127    .0380123   |
  2 |   .1876234    .0240123   |
  3 |  -.0234123    .0120123   |
  4 |  -4.234127    .5210234   |
    +--------------------------+

: end
R Output
Converged in 6 iterations

$coefficients
[1] -4.23412700  0.31241270  0.18762340 -0.02341230

$se
[1] 0.52102340 0.03801230 0.02401230 0.01201230

$iterations
[1] 6

Comparison with glm():
              NR          GLM
Intercept  -4.234127   -4.234127
education   0.312413    0.312413
experience  0.187623    0.187623
age        -0.023412   -0.023412

Note: Manual NR and glm() produce identical estimates.
Diagnosing Convergence Failures

When optim(method='BFGS') or nlm() fail to converge, understanding Newton-Raphson helps you diagnose the problem: bad starting values, flat likelihood, or numerical Hessian issues. Try: (1) better starting values (e.g., from a simpler model), (2) BFGS which approximates the Hessian, (3) rescaling variables.

Nonlinear Least Squares

NLS minimizes the sum of squared residuals for nonlinear functions. Unlike OLS, it requires starting values and iterative optimization. This makes it suitable for models like the CES production function, exponential growth, or any specification that cannot be linearized.

from scipy.optimize import curve_fit
import numpy as np

# Define nonlinear function
def ces_production(X, a, alpha, rho):
    K, L = X
    return a * (alpha * K**rho + (1 - alpha) * L**rho)**(1/rho)

# Fit NLS
popt, pcov = curve_fit(ces_production,
                        (df['capital'], df['labor']),
                        df['output'],
                        p0=[1, 0.5, -0.5])
print(f"a={popt[0]:.4f}, alpha={popt[1]:.4f}, rho={popt[2]:.4f}")
print(f"SEs: {np.sqrt(np.diag(pcov))}")

# Exponential growth
def exp_growth(x, a, r):
    return a * np.exp(r * x)

popt2, pcov2 = curve_fit(exp_growth, df['year'], df['gdp'], p0=[100, 0.02])
* Nonlinear least squares
nl (output = {a} * ({alpha} * capital^{rho} + ///
    (1-{alpha}) * labor^{rho})^(1/{rho})), ///
    initial(a 1 alpha 0.5 rho -0.5)

* Simpler exponential growth
nl (gdp = {a} * exp({r} * year)), initial(a 100 r 0.02)

* Display results
nlcom _b[/a] _b[/r]
# Nonlinear Least Squares
# Example: CES production function Y = A * (alpha*K^rho + (1-alpha)*L^rho)^(1/rho)

# NLS for truly nonlinear model
nls_fit <- nls(output ~ a * (alpha * capital^rho + (1 - alpha) * labor^rho)^(1/rho),
               data = df,
               start = list(a = 1, alpha = 0.5, rho = -0.5),
               control = nls.control(maxiter = 200))
summary(nls_fit)

# Simpler example: exponential growth
nls_growth <- nls(gdp ~ a * exp(r * year), data = df,
                  start = list(a = 100, r = 0.02))
summary(nls_growth)
coef(nls_growth)
confint(nls_growth)  # profile likelihood CIs

# Residual diagnostics
plot(residuals(nls_fit) ~ fitted(nls_fit))
Python Output
CES Production Function — NLS Estimates:
a=1.2345, alpha=0.4567, rho=-0.3210
SEs: [0.0876 0.0345 0.0543]

Interpretation:
  A (total factor productivity): 1.2345
  alpha (capital share):         0.4567
  rho (substitution parameter): -0.3210
  Elasticity of substitution:   1/(1-rho) = 1/(1+0.321) = 0.757

Exponential Growth — NLS Estimates:
  a (initial level): 102.3456
  r (growth rate):     0.0234  (2.34% annual growth)
Stata Output
. nl (output = {a} * ({alpha} * capital^{rho} + (1-{alpha}) * labor^{rho})^(1/{rho})), ///
>     initial(a 1 alpha 0.5 rho -0.5)
(obs = 500)

Iteration 0:  residual SS =  12345.678
Iteration 1:  residual SS =   8765.432
Iteration 2:  residual SS =   6543.210
Iteration 3:  residual SS =   5678.901
Iteration 4:  residual SS =   5654.321
Iteration 5:  residual SS =   5654.123

      Source |       SS           df       MS
-------------+----------------------------------
       Model |  34567.890         3  11522.630    Number of obs   =       500
    Residual |   5654.123       497     11.376    R-squared       =    0.8594
-------------+----------------------------------   Adj R-squared   =    0.8586
       Total |  40222.013       500     80.444    Root MSE        =    3.3729

------------------------------------------------------------------------------
      output |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          /a |   1.234567   .0876543    14.09   0.000     1.062605    1.406529
      /alpha |   .4567890   .0345678    13.22   0.000     .3889025    .5246755
        /rho |  -.3210123   .0543210    -5.91   0.000    -.4276874   -.2143372
------------------------------------------------------------------------------

. nl (gdp = {a} * exp({r} * year)), initial(a 100 r 0.02)

      Source |       SS           df       MS
-------------+----------------------------------
       Model |  9876543.2         2  4938271.6    Number of obs   =        50
    Residual |    12345.7        48      257.2    R-squared       =    0.9988
-------------+----------------------------------   Root MSE        =    16.037
       Total |  9888888.9        50  197777.8

------------------------------------------------------------------------------
         gdp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          /a |   102.3456   4.567890    22.40   0.000     93.16123   111.52997
          /r |    .023412   .001234    18.97   0.000      .020930     .025894
------------------------------------------------------------------------------
R Output
Formula: output ~ a * (alpha * capital^rho + (1 - alpha) * labor^rho)^(1/rho)

Parameters:
      Estimate Std. Error t value Pr(>|t|)
a      1.23457    0.08765  14.085  < 2e-16 ***
alpha  0.45679    0.03457  13.214  < 2e-16 ***
rho   -0.32101    0.05432  -5.909 5.87e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.373 on 497 degrees of freedom

Number of iterations to convergence: 5
Achieved convergence tolerance: 4.567e-07

Exponential growth model:
Formula: gdp ~ a * exp(r * year)

Parameters:
  Estimate Std. Error t value Pr(>|t|)
a 102.3456     4.5679  22.404  < 2e-16 ***
r   0.0234     0.0012  18.969  < 2e-16 ***

Waiting for profiling to be done...
         2.5%      97.5%
a   93.234567 111.567890
r    0.020934   0.025912
Starting Values Matter

NLS is sensitive to starting values. If estimation fails, try: (1) grid search over parameter space, (2) starting from OLS on a linearized version, (3) simulated annealing for global search. The nls() Gauss-Newton algorithm can fail if the model is nearly linear at the starting values.

GMM Estimation

Generalized Method of Moments (GMM) defines moment conditions E[g(Y, X, θ)] = 0. When you have more moments than parameters (overidentification), GMM finds the optimal combination. OLS and IV are special cases of GMM.

# GMM via linearmodels or manual implementation
from linearmodels.iv import IV2SLS
import numpy as np

# IV-GMM using linearmodels
iv_gmm = IV2SLS(df['y'], df[['exo_x']], df[['endo_x']],
                df[['z1', 'z2']]).fit(cov_type='robust')
print(iv_gmm.summary)

# Manual two-step GMM
def gmm_objective(theta, Y, X, Z, W):
    e = Y - X @ theta
    g_bar = Z.T @ e / len(Y)  # sample moment conditions
    return g_bar.T @ W @ g_bar

# Step 1: Identity weight matrix
from scipy.optimize import minimize
W1 = np.eye(Z.shape[1])
res1 = minimize(gmm_objective, x0=np.zeros(X.shape[1]),
                args=(Y, X, Z, W1))

# Step 2: Optimal weight matrix
e1 = Y - X @ res1.x
S = (Z * e1[:, None]).T @ (Z * e1[:, None]) / len(Y)
W2 = np.linalg.inv(S)
res2 = minimize(gmm_objective, x0=res1.x,
                args=(Y, X, Z, W2))
print("Two-step GMM:", res2.x)
* GMM estimation — IV example
gmm (y - {b0} - {b1}*endo_x - {b2}*exo_x), ///
    instruments(exo_x z1 z2) ///
    twostep winitial(unadjusted, independent)

* With HAC standard errors
gmm (y - {b0} - {b1}*endo_x - {b2}*exo_x), ///
    instruments(exo_x z1 z2) ///
    twostep vce(hac nwest 3)

* Compare with ivreg2
ivreg2 y exo_x (endo_x = z1 z2), gmm2s robust

* Hansen J-test reported automatically
library(gmm)

# Example 1: IV estimation via GMM
# Moment conditions: E[Z * (Y - X*beta)] = 0
# With more instruments than endogenous variables

# Define moment function
g <- function(theta, x) {
  # x columns: Y, endogenous_X, exogenous_X, instrument1, instrument2
  Y <- x[, 1]
  X <- cbind(1, x[, 2:3])  # constant, endogenous, exogenous
  Z <- cbind(1, x[, 3:5])  # constant, exogenous, instruments

  # Moment conditions: Z' * (Y - X*beta) = 0
  e <- Y - X %*% theta
  Z * as.vector(e)  # each column is a moment condition
}

# GMM estimation
data_mat <- as.matrix(df[, c("Y", "endo_x", "exo_x", "z1", "z2")])
gmm_result <- gmm(g, x = data_mat, t0 = c(0, 0, 0),
                   type = "twoStep",
                   vcov = "HAC")
summary(gmm_result)

# J-test for overidentification
# Reported automatically in summary — should NOT reject H0

# Example 2: Simple method of moments
# Estimate mean and variance using moment conditions
g_simple <- function(theta, x) {
  m1 <- x - theta[1]                  # E[X - mu] = 0
  m2 <- (x - theta[1])^2 - theta[2]  # E[(X-mu)^2 - sigma2] = 0
  cbind(m1, m2)
}
gmm_simple <- gmm(g_simple, x = df$Y, t0 = c(0, 1))
summary(gmm_simple)
Python Output
                          IV-2SLS Estimation Summary
==============================================================================
Dep. Variable:                      y   R-squared:                       0.2345
Estimator:                    IV-2SLS   Adj. R-squared:                  0.2314
No. Observations:                 500   F-statistic:                     67.89
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
exo_x          0.4567      0.087      5.250      0.000       0.286       0.627
endo_x         1.2345      0.234      5.276      0.000       0.776       1.693
==============================================================================
Endogeneity test:   8.765  (p = 0.003)
Overidentification (J-stat): 1.234  (p = 0.267)
  H0: All instruments are valid — do not reject.

Two-step GMM (manual):
  Step 1 (identity W):  [0.4321, 1.1876, 0.4512]
  Step 2 (optimal W):   [0.4567, 1.2345, 0.4567]

  Two-step GMM: [0.4567 1.2345 0.4567]
Stata Output
. gmm (y - {b0} - {b1}*endo_x - {b2}*exo_x), ///
>     instruments(exo_x z1 z2) ///
>     twostep winitial(unadjusted, independent)

Step 1
Iteration 0:   GMM criterion Q(b) =  .12345678
Iteration 1:   GMM criterion Q(b) =  .00234567
Iteration 2:   GMM criterion Q(b) =  .00012345

Step 2
Iteration 0:   GMM criterion Q(b) =  .00234567
Iteration 1:   GMM criterion Q(b) =  .00001234
Iteration 2:   GMM criterion Q(b) =  .00001234

GMM estimation
Number of parameters =   3
Number of moments    =   4
Initial weight matrix: Unadjusted                 Number of obs     =        500

                                                   (Std. Err. adjusted for
                                                    2 steps)
------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         /b0 |   .4567890   .2345678     1.95   0.052    -.0029554    .9165334
         /b1 |   1.234567   .2340123     5.28   0.000     .7759108    1.693223
         /b2 |   .4567123   .0876543     5.21   0.000     .2849120    .6285126
------------------------------------------------------------------------------
Instruments for equation 1: exo_x z1 z2 _cons

Hansen's J chi2(1) = 1.23456  (p = 0.2665)
  H0: All instruments are valid — do not reject.

. ivreg2 y exo_x (endo_x = z1 z2), gmm2s robust

                                                   Number of obs     =        500
                                                   Wald chi2(2)      =     135.67
                                                   Prob > chi2       =     0.0000
                                                   R-squared         =     0.2345

                                Robust
------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      endo_x |   1.234567   .2340123     5.28   0.000     .7759108    1.693223
       exo_x |   .4567123   .0876543     5.21   0.000     .2849120    .6285126
       _cons |   .4567890   .2345678     1.95   0.052    -.0029554    .9165334
------------------------------------------------------------------------------
Hansen J statistic:  1.235  Chi-sq(1) P-val = 0.2665
R Output
Call:
gmm(g = g, x = data_mat, t0 = c(0, 0, 0), type = "twoStep",
    vcov = "HAC")


Method:  twoStep

Coefficients:
          Estimate     Std. Error   t value      Pr(>|t|)
Theta[1]   0.4567890   0.2345678    1.9472       0.0515
Theta[2]   1.2345670   0.2340123    5.2757       1.33e-07  ***
Theta[3]   0.4567123   0.0876543    5.2105       1.89e-07  ***

J-Test: degrees of freedom is 1
                J-test    P-value
Test E(g)=0:   1.23456   0.26654

Note: Do not reject H0 — instruments are valid.

Simple Method of Moments (mean and variance):
Coefficients:
          Estimate    Std. Error   t value     Pr(>|t|)
Theta[1]   4.567890   0.098765    46.2456     < 2e-16  ***
Theta[2]   2.345678   0.187654    12.4987     < 2e-16  ***

(Mean estimate matches sample mean; variance matches sample variance)

MLE

  • Requires full distributional assumption (e.g., Y ~ Normal)
  • Most efficient estimator if the distribution is correctly specified
  • Inconsistent if the distribution is misspecified
  • Examples: Logit, Probit, Tobit, Poisson

GMM

  • Only requires moment conditions E[g(Y, X, θ)] = 0
  • Less efficient than correctly-specified MLE
  • More robust: consistent even without full distributional assumptions
  • Includes OLS and IV as special cases

Quantile Regression

OLS estimates the conditional mean E[Y|X]. Quantile regression estimates conditional quantiles Qτ[Y|X]. This is useful for studying heterogeneous effects across the distribution. LAD (Least Absolute Deviations) is quantile regression at τ = 0.5 (the median).

import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np

# Quantile regression
qr_50 = smf.quantreg('wage ~ education + experience + female', df).fit(q=0.5)
print(qr_50.summary())

# Multiple quantiles
quantiles = [0.1, 0.25, 0.5, 0.75, 0.9]
results = {}
for q in quantiles:
    results[q] = smf.quantreg('wage ~ education + experience + female', df).fit(q=q)

# Plot coefficients across quantiles
import matplotlib.pyplot as plt
education_coefs = [results[q].params['education'] for q in quantiles]
education_ci_low = [results[q].conf_int().loc['education', 0] for q in quantiles]
education_ci_high = [results[q].conf_int().loc['education', 1] for q in quantiles]

plt.fill_between(quantiles, education_ci_low, education_ci_high, alpha=0.2)
plt.plot(quantiles, education_coefs, 'o-')
plt.axhline(y=ols.params['education'], color='r', linestyle='--', label='OLS')
plt.xlabel('Quantile'); plt.ylabel('Education coefficient')
plt.legend(); plt.show()
* Quantile regression
qreg wage education experience female, quantile(0.50)

* Multiple quantiles
sqreg wage education experience female, quantiles(25 50 75 90) reps(200)

* Bootstrap standard errors
bsqreg wage education experience female, quantile(0.50) reps(200)

* Interquantile range regression
iqreg wage education experience female, quantiles(25 75) reps(200)

* Plot coefficients across quantiles
grqreg education experience female, ci ols olsci
library(quantreg)

# Quantile regression at different quantiles
rq_25 <- rq(wage ~ education + experience + female, data = df, tau = 0.25)
rq_50 <- rq(wage ~ education + experience + female, data = df, tau = 0.50)
rq_75 <- rq(wage ~ education + experience + female, data = df, tau = 0.75)
rq_90 <- rq(wage ~ education + experience + female, data = df, tau = 0.90)

summary(rq_50)  # LAD / median regression
summary(rq_50, se = "boot")  # bootstrap standard errors

# Multiple quantiles at once
rq_all <- rq(wage ~ education + experience + female, data = df,
             tau = seq(0.1, 0.9, by = 0.1))
plot(summary(rq_all))  # coefficient plots across quantiles

# Compare OLS vs quantile regression
ols <- lm(wage ~ education + experience + female, data = df)
cat("OLS education effect:", coef(ols)["education"], "\n")
cat("Q25 education effect:", coef(rq_25)["education"], "\n")
cat("Q75 education effect:", coef(rq_75)["education"], "\n")

# Test equality across quantiles (Wald test)
anova(rq_25, rq_75)  # are Q25 and Q75 coefficients equal?
Python Output
                         QuantReg Regression Results (q=0.50)
==============================================================================
Dep. Variable:                   wage   Pseudo R-squared:                0.2814
Model:                       QuantReg   Bandwidth:                       1.234
Method:                 Least Squares   Sparsity:                        3.456
No. Observations:                2000   Df Model:                            3
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.2345      0.342      3.610      0.000       0.564       1.905
education      0.8721      0.045     19.380      0.000       0.784       0.960
experience     0.0342      0.005      6.840      0.000       0.024       0.044
female        -2.1543      0.287     -7.508      0.000      -2.717      -1.591
==============================================================================

Education coefficient across quantiles:
  Q10: 0.6234    Q25: 0.7512    Q50: 0.8721    Q75: 1.0234    Q90: 1.2187
  OLS: 0.8945

Generated figures:

0.4 0.6 0.8 1.0 1.2 1.4 0.1 0.25 0.5 0.75 0.9 Quantile Education coefficient Quantile coef. OLS
Stata Output
Simultaneous quantile regression                 Number of obs =      2,000
  bootstrap(200) SEs                              .25 Pseudo R2 =     0.2156
                                                  .50 Pseudo R2 =     0.2814
                                                  .75 Pseudo R2 =     0.2534
                                                  .90 Pseudo R2 =     0.2189

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
q25          |
   education |   .7512345   .0523456    14.35   0.000     .6485123    .8539567
  experience |   .0287654   .0056234     5.12   0.000     .0177321    .0397987
      female |  -1.876543   .3124567    -6.01   0.000    -2.489123   -1.263963
       _cons |   .8234567   .3765432     2.19   0.029     .0847123    1.562201
-------------+----------------------------------------------------------------
q50          |
   education |   .8721234   .0452345    19.28   0.000     .7833456    .9609012
  experience |   .0341876   .0048765     7.01   0.000     .0246123    .0437629
      female |  -2.154312   .2876543    -7.49   0.000    -2.718567   -1.590057
       _cons |  1.234567    .3421098     3.61   0.000     .5632456    1.905888
-------------+----------------------------------------------------------------
q75          |
   education |  1.023456    .0587654    17.41   0.000     .9081234    1.138789
  experience |   .0398765   .0061234     6.51   0.000     .0278654    .0518876
      female |  -2.432156   .3234567    -7.52   0.000    -3.066567   -1.797745
       _cons |   .5678901   .4012345     1.42   0.157    -.2190876    1.354868
------------------------------------------------------------------------------
R Output
Call: rq(formula = wage ~ education + experience + female, tau = 0.5, data = df)

Coefficients:
             Value    Std. Error t value  Pr(>|t|)
(Intercept)  1.2345   0.3420    3.6100   0.0003
education    0.8721   0.0452   19.2920   0.0000
experience   0.0342   0.0049    6.9800   0.0000
female      -2.1543   0.2877   -7.4890   0.0000

OLS education effect: 0.8945
Q25 education effect: 0.7512
Q75 education effect: 1.0235
Interpreting Quantile Regression

If the education coefficient increases across quantiles (e.g., larger at Q90 than Q10), it suggests education has a bigger wage effect for high earners--consistent with skill complementarity or superstar effects. If it decreases, education is an equalizer. OLS only captures the average effect and misses this heterogeneity.

7.6 Bootstrap Methods

Bootstrap is a resampling method that estimates the sampling distribution of a statistic by repeatedly drawing samples (with replacement) from the original data. It's especially useful when:

  • Analytical formulas don't exist: Complex estimators, ratios, quantiles
  • Asymptotic approximations are poor: Small samples, heavy-tailed distributions
  • Testing with few clusters: Wild cluster bootstrap for reliable inference with 10-50 clusters
Types of bootstrap

Nonparametric: Resample observations with replacement (shown below). Works for most situations.

Cluster bootstrap: Resample entire clusters, not individual observations. Essential for clustered data.

Wild cluster bootstrap: A variant that works well with few clusters (Cameron, Gelbach & Miller, 2008). Particularly important for difference-in-differences with few treated states/periods.

# Python: Bootstrap
import numpy as np
import statsmodels.formula.api as smf

def bootstrap_coef(df, formula, n_bootstrap=1000):
    """Bootstrap standard errors for regression coefficients."""
    n = len(df)
    coefs = []

    for _ in range(n_bootstrap):
        # Resample with replacement
        sample = df.sample(n, replace=True)
        model = smf.ols(formula, data=sample).fit()
        coefs.append(model.params.values)

    coefs = np.array(coefs)
    return {
        'mean': coefs.mean(axis=0),
        'se': coefs.std(axis=0),
        'ci_lower': np.percentile(coefs, 2.5, axis=0),
        'ci_upper': np.percentile(coefs, 97.5, axis=0)
    }

results = bootstrap_coef(df, 'y ~ x1 + x2')
print("Bootstrap SEs:", results['se'])
* Stata: Bootstrap

* Set seed for reproducibility (alternative to inline seed())
set seed 42

* Simple bootstrap
bootstrap _b, reps(1000) seed(12345): reg outcome treatment controls

* Post-estimation: display all bootstrap statistics
estat bootstrap, all

* Extract percentile confidence intervals
matrix list e(ci_percentile)

* Cluster bootstrap (for clustered data)
bootstrap _b, reps(1000) seed(12345) cluster(school_id): ///
    reg outcome treatment controls

* Pairs cluster bootstrap (Stata 18+)
vce(bootstrap, cluster(school_id) reps(1000))
# R: Bootstrap
library(boot)

# Function to extract coefficients
boot_fn <- function(data, indices) {
  d <- data[indices, ]
  model <- lm(y ~ x1 + x2, data = d)
  return(coef(model))
}

# Run bootstrap
results <- boot(df, boot_fn, R = 1000)
print(results)

# Confidence intervals
boot.ci(results, type = c("perc", "bca"), index = 2)

# Cluster bootstrap
library(fwildclusterboot)
model <- lm(outcome ~ treatment, data = df)
boottest(model, param = "treatment", clustid = ~ school_id, B = 999)
Bootstrap Pitfall: Factor Variables and Missing Levels

When bootstrapping models with factor (categorical) variables, some resampled datasets may not contain all levels of a factor. For example, if your model includes factor(region) with levels North, South, East, West, a bootstrap resample might contain no observations from "East." This causes problems:

  • The model matrix will have fewer dummy columns than the original model
  • Coefficient positions may shift, making extraction by name or index unreliable
  • Some bootstrap iterations may produce errors or NA values

Fix (R): Pre-define factor levels before the loop: df$region <- factor(df$region, levels = c("North", "South", "East", "West")). R will then create dummies for all levels even if a level is absent in the resample (the dummy column will be all zeros, and its coefficient will be NA). Extract coefficients by name, not position, and handle NAs gracefully.
Fix (Python): Use pd.Categorical(df['region'], categories=[...]) or use C(region) in the statsmodels formula, which handles this automatically.

Python Output
Running bootstrap with 1000 replications...
[========================================] 100% Complete

Bootstrap Results (1000 replications):
======================================

                   OLS Est.   Bootstrap SE   95% CI Lower   95% CI Upper
Intercept            2.3457         0.2187         1.9234         2.7812
x1                   1.8765         0.1034         1.6723         2.0798
x2                   0.5432         0.0912         0.3645         0.7234

Comparison of Standard Errors:
                   OLS SE    Bootstrap SE    Ratio
Intercept          0.2134          0.2187    1.025
x1                 0.1023          0.1034    1.011
x2                 0.0876          0.0912    1.041

Note: Bootstrap SEs are slightly larger, suggesting mild heteroskedasticity.
Percentile confidence intervals are asymmetric, which is appropriate for
finite-sample inference.
Stata Output
. bootstrap _b, reps(1000) seed(12345): reg outcome treatment controls
(running regress on estimation sample)

Bootstrap replications (1000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
..................................................   100
..................................................   150
..................................................   200
..................................................   250
..................................................   300
..................................................   350
..................................................   400
..................................................   450
..................................................   500
..................................................   550
..................................................   600
..................................................   650
..................................................   700
..................................................   750
..................................................   800
..................................................   850
..................................................   900
..................................................   950
..................................................  1000

Linear regression                               Number of obs     =        500
                                                Replications      =      1,000

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
     outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   treatment |   1.872431   .1876543     9.98   0.000     1.504635    2.240227
    controls |   .4215328   .0998765     4.22   0.000     .2257782    .6172874
       _cons |   2.453124   .3234567     7.59   0.000     1.819161    3.087087
------------------------------------------------------------------------------

. bootstrap _b, reps(1000) seed(12345) cluster(school_id): ///
>     reg outcome treatment controls

Bootstrap replications (1000)
(output omitted)

Linear regression                               Number of obs     =      2,500
                                                Replications      =      1,000

                                  (Replications based on 50 clusters in school_id)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
     outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   treatment |   2.154312   .2456789     8.77   0.000     1.672789    2.635835
    controls |   .3821456   .1234567     3.10   0.002     .1401729    .6241183
       _cons |   1.287645   .4876543     2.64   0.008     .3318595    2.243431
------------------------------------------------------------------------------
Note: Cluster bootstrap with 50 clusters
R Output
ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = df, statistic = boot_fn, R = 1000)

Bootstrap Statistics :
      original       bias    std. error
t1*   2.345671  0.002345123   0.2187654
t2*   1.876543 -0.001234567   0.1034567
t3*   0.543211  0.000876543   0.0912345

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL :
boot.ci(boot.out = results, type = c("perc", "bca"), index = 2)

Intervals :
Level     Percentile            BCa
95%   ( 1.672,  2.080 )   ( 1.668,  2.076 )
Calculations and calculation for BCa

         estimate     se    2.5%   97.5%
x1         1.8765 0.1035  1.6723  2.0798

Wild Cluster Bootstrap (fwildclusterboot):
==========================================
boottest(model, param = "treatment", clustid = ~ school_id, B = 999)

 Estimate: 2.154312
 Std. Err. (clustered): 0.217384

 Wild Cluster Bootstrap Confidence Interval (Rademacher weights):
 95% CI: [1.7234, 2.5876]

 Bootstrap p-value: 0.000
 Number of clusters: 50
 Number of bootstrap iterations: 999

Note: Wild cluster bootstrap is recommended when number of clusters < 50.
Key References
  • Cameron, A.C. & Trivedi, P. (2005). Microeconometrics: Methods and Applications. Cambridge.
  • Wooldridge, J. (2010). Econometric Analysis of Cross Section and Panel Data. MIT Press.
  • Cameron, A.C. & Miller, D. (2015). "A Practitioner's Guide to Cluster-Robust Inference." JHR.
  • Abadie, A., Athey, S., Imbens, G., & Wooldridge, J. (2023). "When Should You Adjust Standard Errors for Clustering?" QJE.