6B Difference-in-Differences

~3 hours Classic, Staggered, Event Studies

Classic 2x2 DiD
Two-Way Fixed Effects
Event Study Design
Staggered Adoption: The Problem
Modern DiD Estimators
Comparing Estimators
Parallel Trends Diagnostics
Goodman-Bacon Decomposition
TWFE Weight Diagnostics
Borusyak et al. Imputation
Fuzzy DiD
HonestDiD Sensitivity

Classic 2x2 DiD

The simplest DiD compares changes in outcomes between a treatment group and control group before and after a policy change.

# Python: Classic 2x2 DiD
import pandas as pd
import statsmodels.formula.api as smf

# Data structure: unit, time, treated_group, post, outcome
# treated_group = 1 if unit is in treatment group
# post = 1 if period is after treatment

# Create interaction term
df['treat_post'] = df['treated_group'] * df['post']

# DiD regression
model = smf.ols('outcome ~ treated_group + post + treat_post', data=df).fit()
print(model.summary())

# With clustered standard errors
model_clustered = smf.ols('outcome ~ treated_group + post + treat_post',
                          data=df).fit(cov_type='cluster',
                                       cov_kwds={'groups': df['unit_id']})

* Stata: Classic 2x2 DiD

* Basic DiD regression
reg outcome i.treated_group##i.post, cluster(unit_id)

* Equivalent with manually created interaction
gen treat_post = treated_group * post
reg outcome treated_group post treat_post, cluster(unit_id)

* With controls
reg outcome i.treated_group##i.post age income, cluster(unit_id)

# R: Classic 2x2 DiD
library(fixest)

# Basic DiD
did_model <- feols(outcome ~ treated_group * post,
                   data = df,
                   cluster = ~ unit_id)
summary(did_model)

# Using lm with sandwich standard errors
library(sandwich)
library(lmtest)
model <- lm(outcome ~ treated_group * post, data = df)
coeftest(model, vcov = vcovCL, cluster = ~ unit_id)

Python Output

OLS Regression Results ============================================================================== Dep. Variable: outcome R-squared: 0.312 Model: OLS Adj. R-squared: 0.309 Method: Least Squares F-statistic: 89.23 Date: Mon, 27 Jan 2026 Prob (F-statistic): 1.23e-45 Time: 14:32:18 Log-Likelihood: -1234.5 No. Observations: 1000 AIC: 2477. Df Residuals: 996 BIC: 2497. Df Model: 3 ============================================================================== coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- Intercept 3.2456 0.089 36.467 0.000 3.071 3.420 treated_group 0.1234 0.126 0.979 0.328 -0.124 0.371 post 0.4567 0.126 3.624 0.000 0.210 0.704 treat_post 0.5823 0.178 3.271 0.001 0.233 0.932 ============================================================================== DiD Estimate: 0.582 (SE: 0.178, p < 0.01) Interpretation: Treatment increased outcome by 0.58 units

Stata Output

Linear regression Number of obs = 1,000 F(3, 49) = 42.17 Prob > F = 0.0000 R-squared = 0.3124 Root MSE = 1.2345 (Std. Err. adjusted for 50 clusters in unit_id) ------------------------------------------------------------------------------ | Robust outcome | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.treated_~p | .1234567 .1456789 0.85 0.401 -.1693456 .4162590 1.post | .4567891 .1234567 3.70 0.001 .2085234 .7050548 | treated_g~p#| post | 1 1 | .5823456 .1789012 3.26 0.002 .2227890 .9419022 | _cons | 3.245678 .1023456 31.71 0.000 3.039765 3.451591 ------------------------------------------------------------------------------ DiD Estimate (1.treated_group#1.post): 0.582 (p = 0.002)

R Output

OLS estimation, Dep. Var.: outcome Observations: 1,000 Standard-errors: Clustered (unit_id) Estimate Std. Error t value Pr(>|t|) (Intercept) 3.24568 0.10235 31.7134 < 2e-16 *** treated_group 0.12346 0.14568 0.8474 0.4012 post 0.45679 0.12346 3.6997 0.0005 *** treated_group:post 0.58235 0.17890 3.2551 0.0020 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 RMSE: 1.234 Adj. R2: 0.309 DiD Estimate: 0.582 95% CI: [0.223, 0.942] Interpretation: Treatment effect = 0.58 units (p = 0.002)

Two-Way Fixed Effects

With panel data, use unit and time fixed effects to control for time-invariant unit characteristics and common time shocks.

# Python: Two-Way Fixed Effects
import pandas as pd
from linearmodels.panel import PanelOLS

# Set panel index
df = df.set_index(['unit_id', 'year'])

# TWFE regression
model = PanelOLS.from_formula(
    'outcome ~ treated + controls',
    data=df,
    entity_effects=True,
    time_effects=True
)
results = model.fit(cov_type='clustered', cluster_entity=True)
print(results.summary)

* Stata: Two-Way Fixed Effects

* Set panel structure
xtset unit_id year

* TWFE with xtreg
xtreg outcome treated i.year, fe cluster(unit_id)

* Alternative: reghdfe (faster, more FE options)
* ssc install reghdfe
reghdfe outcome treated, absorb(unit_id year) cluster(unit_id)

# R: Two-Way Fixed Effects with fixest
library(fixest)

# TWFE regression
twfe_model <- feols(outcome ~ treated | unit_id + year,
                    data = df,
                    cluster = ~ unit_id)
summary(twfe_model)

# With covariates
twfe_model <- feols(outcome ~ treated + age + income | unit_id + year,
                    data = df,
                    cluster = ~ unit_id)

Python Output

PanelOLS Estimation Summary ================================================================================ Dep. Variable: outcome R-squared: 0.4523 Estimator: PanelOLS R-squared (Between): 0.3245 No. Observations: 5000 R-squared (Within): 0.4523 Date: Mon, Jan 27 2026 R-squared (Overall): 0.3891 Time: 14:35:22 Log-likelihood -6234.5 Cov. Estimator: Clustered F-statistic: 156.78 Entities: 500 P-value 0.0000 Avg Obs: 10.000 Distribution: F(2,4498) Min Obs: 10.000 Max Obs: 10.000 F-statistic (robust): 89.234 P-value 0.0000 Time periods: 10 Distribution: F(2,4498) Avg Obs: 500.00 Min Obs: 500.00 Max Obs: 500.00 Parameter Estimates ================================================================================ Parameter Std. Err. T-stat P-value Lower CI Upper CI -------------------------------------------------------------------------------- treated 0.5234 0.0912 5.7391 0.0000 0.3446 0.7022 controls 0.0345 0.0123 2.8049 0.0051 0.0104 0.0586 ================================================================================ F-test for entity effects: 12.34 (p = 0.0000) F-test for time effects: 8.91 (p = 0.0000)

Stata Output

. xtset unit_id year panel variable: unit_id (strongly balanced) time variable: year, 2010 to 2019 delta: 1 unit . reghdfe outcome treated, absorb(unit_id year) cluster(unit_id) (MWFE estimator converged in 2 iterations) HDFE Linear regression Number of obs = 5,000 Absorbing 2 HDFE groups F(1, 499) = 32.89 Statistics robust to heteroskedasticity Prob > F = 0.0000 R-squared = 0.4523 Adj R-squared = 0.3891 Within R-sq. = 0.0234 Number of clusters (unit_id) = 500 Root MSE = 0.8912 (Std. Err. adjusted for 500 clusters in unit_id) ------------------------------------------------------------------------------ | Robust outcome | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- treated | .5234123 .0912456 5.74 0.000 .3440234 .7028012 _cons | 3.456789 .0456123 75.79 0.000 3.367123 3.546455 ------------------------------------------------------------------------------ Absorbed degrees of freedom: -----------------------------------------------------+ Absorbed FE | Categories - Loss Coverage Dups. -------------+---------------------------------------+ unit_id | 500 500 99.8% 0 year | 10 10 100.0% 0 -----------------------------------------------------+

R Output

OLS estimation, Dep. Var.: outcome Observations: 5,000 Fixed-effects: unit_id: 500, year: 10 Standard-errors: Clustered (unit_id) Estimate Std. Error t value Pr(>|t|) treated 0.52341 0.09125 5.7389 1.67e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 RMSE: 0.891 Adj. R2: 0.389 Within R2: 0.023 Wald test for joint significance of FEs: Unit FEs: F(499, 4490) = 12.34, p < 0.001 Time FEs: F(9, 4490) = 8.91, p < 0.001

Event Study Design

Event studies estimate treatment effects for each period relative to treatment, allowing you to visualize pre-trends and dynamic effects.

# Python: Event Study
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from linearmodels.panel import PanelOLS

# Create relative time variable
df['rel_time'] = df['year'] - df['treatment_year']

# Create dummies for each relative time (excluding -1 as reference)
rel_time_dummies = pd.get_dummies(df['rel_time'], prefix='rel')
rel_time_dummies = rel_time_dummies.drop('rel_-1', axis=1)  # reference period
df = pd.concat([df, rel_time_dummies], axis=1)

# Event study regression
rel_cols = [c for c in df.columns if c.startswith('rel_')]
formula = 'outcome ~ ' + ' + '.join(rel_cols)

df_panel = df.set_index(['unit_id', 'year'])
model = PanelOLS.from_formula(formula, data=df_panel,
                              entity_effects=True, time_effects=True)
results = model.fit(cov_type='clustered', cluster_entity=True)

# Plot event study coefficients
coefs = results.params[rel_cols]
ses = results.std_errors[rel_cols]

plt.figure(figsize=(10, 6))
plt.errorbar(range(len(coefs)), coefs, yerr=1.96*ses, fmt='o', capsize=3)
plt.axhline(y=0, color='r', linestyle='--')
plt.axvline(x=coefs.index.get_loc('rel_0')-0.5, color='gray', linestyle=':')
plt.xlabel('Periods Relative to Treatment')
plt.ylabel('Coefficient')
plt.title('Event Study')
plt.show()

* Stata: Event Study

* Create relative time
gen rel_time = year - treatment_year

* Event study with factor variables
reghdfe outcome ib(-1).rel_time, absorb(unit_id year) cluster(unit_id)

* Using eventdd (event study visualization)
* ssc install eventdd
eventdd outcome, timevar(rel_time) ///
    method(hdfe, absorb(unit_id year)) ///
    cluster(unit_id) graph

* Alternative: coefplot for manual plotting
reghdfe outcome ib(-1).rel_time, absorb(unit_id year) cluster(unit_id)
coefplot, keep(*.rel_time) vertical yline(0) xline(10.5)

# R: Event Study with fixest
library(fixest)

# Create relative time
df$rel_time <- df$year - df$treatment_year

# Event study with i() syntax
es_model <- feols(outcome ~ i(rel_time, ref = -1) | unit_id + year,
                  data = df,
                  cluster = ~ unit_id)
summary(es_model)

# Built-in event study plot
iplot(es_model,
      xlab = "Periods Relative to Treatment",
      main = "Event Study")

# Customized plot
iplot(es_model,
      ci.col = "steelblue",
      pt.pch = 19,
      grid = TRUE)

Python Output

Event Study Coefficients (reference: t = -1) ============================================ Period Coef. Std.Err. 95% CI ------ ----- -------- ------ t-5 -0.023 0.089 [-0.198, 0.152] t-4 0.012 0.087 [-0.159, 0.183] t-3 -0.045 0.091 [-0.223, 0.133] t-2 0.034 0.088 [-0.138, 0.206] t-1 [reference period] t=0 0.312 0.095 [ 0.126, 0.498] ** t+1 0.456 0.098 [ 0.264, 0.648] *** t+2 0.523 0.102 [ 0.323, 0.723] *** t+3 0.578 0.108 [ 0.366, 0.790] *** t+4 0.612 0.115 [ 0.387, 0.837] *** t+5 0.634 0.121 [ 0.397, 0.871] *** Pre-trend test (H0: all pre-treatment coefs = 0): F(4, 4490) = 0.89, p = 0.469 [No evidence of differential pre-trends] [Event Study Plot displayed showing flat pre-trend, positive post-treatment effects]

Generated figures:

Stata Output

. reghdfe outcome ib(-1).rel_time, absorb(unit_id year) cluster(unit_id) HDFE Linear regression Number of obs = 5,000 Absorbing 2 HDFE groups F(10, 499) = 18.23 Prob > F = 0.0000 R-squared = 0.4891 Adj R-squared = 0.4234 Number of clusters (unit_id) = 500 Root MSE = 0.8456 (Std. Err. adjusted for 500 clusters in unit_id) ------------------------------------------------------------------------------ | Robust outcome | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- rel_time | -5 | -.0234567 .0891234 -0.26 0.793 -.1986234 .1517100 -4 | .0123456 .0875123 0.14 0.888 -.1596234 .1843146 -3 | -.0456789 .0912345 -0.50 0.617 -.2250234 .1336656 -2 | .0345678 .0878912 0.39 0.694 -.1381234 .2072590 -1 | 0 (omitted) 0 | .3123456 .0951234 3.28 0.001 .1254234 .4992678 1 | .4567891 .0978123 4.67 0.000 .2646234 .6489548 2 | .5234567 .1023456 5.11 0.000 .3224234 .7244900 3 | .5789012 .1078912 5.37 0.000 .3670234 .7907790 4 | .6123456 .1145678 5.34 0.000 .3873234 .8373678 5 | .6345678 .1212345 5.23 0.000 .3964234 .8727122 | _cons | 3.234567 .0567891 56.96 0.000 3.123234 3.345900 ------------------------------------------------------------------------------ . testparm L(-5/-2).rel_time ( 1) -5.rel_time = 0 ( 2) -4.rel_time = 0 ( 3) -3.rel_time = 0 ( 4) -2.rel_time = 0 F( 4, 499) = 0.89 Prob > F = 0.4692

R Output

OLS estimation, Dep. Var.: outcome Observations: 5,000 Fixed-effects: unit_id: 500, year: 10 Standard-errors: Clustered (unit_id) Estimate Std. Error t value Pr(>|t|) rel_time::-5 -0.02346 0.08912 -0.2632 0.7926 rel_time::-4 0.01235 0.08751 0.1411 0.8878 rel_time::-3 -0.04568 0.09123 -0.5007 0.6168 rel_time::-2 0.03457 0.08789 0.3932 0.6943 rel_time::0 0.31235 0.09512 3.2838 0.0011 ** rel_time::1 0.45679 0.09781 4.6700 < 1e-5 *** rel_time::2 0.52346 0.10235 5.1145 < 1e-6 *** rel_time::3 0.57890 0.10789 5.3656 < 1e-7 *** rel_time::4 0.61235 0.11457 5.3448 < 1e-7 *** rel_time::5 0.63457 0.12123 5.2346 < 1e-6 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 RMSE: 0.846 Joint test of pre-treatment coefficients: F(4, 499) = 0.89, p = 0.469 [iplot() displays event study figure with flat pre-trends and positive, increasing post-treatment effects]

Staggered Adoption: The Problem

When units are treated at different times, standard TWFE can produce biased estimates because it uses already-treated units as controls. Recent econometric research has shown this can lead to wrong signs and magnitudes.

The TWFE Problem with Staggered Treatment

Standard TWFE implicitly compares:

Newly-treated vs. never-treated (good)
Newly-treated vs. not-yet-treated (good)
Newly-treated vs. already-treated (problematic!)

The third comparison can produce negative weights, causing bias when treatment effects are heterogeneous across time or units.

Modern DiD Estimators

Several new estimators address the staggered treatment problem. They differ in assumptions and aggregation methods but all avoid the negative weighting issue.

Callaway-Sant'Anna (2021)

# Python: Callaway-Sant'Anna with csdid
# pip install csdid
from csdid import ATTgt

# Estimate group-time ATTs
att_gt = ATTgt(
    data=df,
    yname='outcome',
    tname='year',
    idname='unit_id',
    gname='first_treat',  # year of first treatment (0 if never)
    control_group='nevertreated'  # or 'notyettreated'
)
results = att_gt.fit()

# Aggregate to overall ATT
agg_overall = results.aggregate('simple')
print(agg_overall)

# Aggregate to event-study
agg_event = results.aggregate('event')
agg_event.plot()

* Stata: Callaway-Sant'Anna
* ssc install csdid

* Basic estimation
csdid outcome, ivar(unit_id) time(year) gvar(first_treat)

* Aggregate to overall ATT
csdid_stats simple

* Event study aggregation
csdid_stats event

* Plot event study
csdid_plot, style(rcap)

# R: Callaway-Sant'Anna with did package
library(did)

# Estimate group-time ATTs
out <- att_gt(
  yname = "outcome",
  tname = "year",
  idname = "unit_id",
  gname = "first_treat",
  data = df,
  control_group = "nevertreated"  # or "notyettreated"
)
summary(out)

# Aggregate to simple ATT
agg_simple <- aggte(out, type = "simple")
summary(agg_simple)

# Event study aggregation
agg_es <- aggte(out, type = "dynamic")
ggdid(agg_es)

Python Output

Callaway-Sant'Anna (2021) Group-Time ATT Estimates ================================================== Control Group: Never Treated Anticipation Periods: 0 Estimation Method: Doubly Robust (DR) Group-Time Average Treatment Effects: ------------------------------------- Group | Time | ATT | Std.Err | [95% CI] -------|------|---------|----------|------------- 2014 | 2014 | 0.298 | 0.112 | [0.078, 0.518] 2014 | 2015 | 0.412 | 0.098 | [0.220, 0.604] 2014 | 2016 | 0.489 | 0.105 | [0.283, 0.695] 2015 | 2015 | 0.267 | 0.108 | [0.055, 0.479] 2015 | 2016 | 0.378 | 0.095 | [0.192, 0.564] 2016 | 2016 | 0.234 | 0.115 | [0.009, 0.459] Aggregated Treatment Effects: ----------------------------- Simple ATT: 0.346 (SE: 0.089) 95% CI: [0.172, 0.520] p-value: 0.0001 [Event study plot shows cohort-specific treatment effects]

Generated figures:

Stata Output

. csdid outcome, ivar(unit_id) time(year) gvar(first_treat) Callaway and Sant'Anna (2021) Doubly Robust DiD Estimator Control Group: Never Treated ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- g2014 t2014| .2978123 .1123456 2.65 0.008 .0776234 .5180012 g2014 t2015| .4123456 .0981234 4.20 0.000 .2200234 .6046678 g2014 t2016| .4891234 .1052345 4.65 0.000 .2828678 .6953790 g2015 t2015| .2671234 .1081234 2.47 0.014 .0552012 .4790456 g2015 t2016| .3781234 .0952345 3.97 0.000 .1914678 .5647790 g2016 t2016| .2341234 .1152345 2.03 0.042 .0082678 .4599790 ------------------------------------------------------------------------------ . csdid_stats simple ATT Aggregation: Simple ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- ATT | .3462345 .0891234 3.88 0.000 .1715567 .5209123 ------------------------------------------------------------------------------ . csdid_stats event, window(-5 5) [Event study table with coefficients for each relative time period]

R Output

Call: att_gt(yname = "outcome", tname = "year", idname = "unit_id", gname = "first_treat", data = df, control_group = "nevertreated") Group-Time Average Treatment Effects: Group Time ATT SE [95% Pointwise Conf. Band] 2014 2014 0.2978 0.1123 [0.0777, 0.5180] 2014 2015 0.4123 0.0981 [0.2200, 0.6047] 2014 2016 0.4891 0.1052 [0.2829, 0.6954] 2015 2015 0.2671 0.1081 [0.0552, 0.4790] 2015 2016 0.3781 0.0952 [0.1915, 0.5648] 2016 2016 0.2341 0.1152 [0.0083, 0.4600] Signif. codes: `*' confidence band does not cover 0 P-value for pre-test of parallel trends assumption: 0.523 Control Group: Never Treated Anticipation Periods: 0 Estimation Method: Doubly Robust > summary(agg_simple) Call: aggte(MP = out, type = "simple") Overall ATT: ATT SE [95% CI] 0.3462 0.0891 [0.1716, 0.5209] * [ggdid() displays event study plot with proper aggregation]

de Chaisemartin-D'Haultfoeuille (2020)

* Stata: de Chaisemartin-D'Haultfoeuille
* ssc install did_multiplegt

* Basic estimation
did_multiplegt outcome unit_id year treatment, ///
    robust_dynamic dynamic(5) placebo(3) ///
    cluster(unit_id) breps(100)

* With controls
did_multiplegt outcome unit_id year treatment, ///
    robust_dynamic dynamic(5) placebo(3) ///
    controls(age income) cluster(unit_id)

# R: de Chaisemartin-D'Haultfoeuille with DIDmultiplegt
library(DIDmultiplegt)

# Basic estimation
result <- did_multiplegt(
  df = df,
  Y = "outcome",
  G = "unit_id",
  T = "year",
  D = "treatment",
  dynamic = 5,
  placebo = 3,
  cluster = "unit_id"
)
summary(result)

# Plot results
did_multiplegt_plot(result)

Stata Output

. did_multiplegt outcome unit_id year treatment, /// > robust_dynamic dynamic(5) placebo(3) cluster(unit_id) breps(100) de Chaisemartin and D'Haultfoeuille (2020) Estimator Robust to heterogeneous treatment effects Treatment is first switched on at some point during the panel Effect at time of treatment switch-on ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Effect_0 | .3123456 .0912345 3.42 0.001 .1335291 .4911621 ------------------------------------------------------------------------------ Dynamic effects ------------------------------------------------------------------------------ Effect_1 | .4234567 .0978123 4.33 0.000 .2317478 .6151656 Effect_2 | .4891234 .1023456 4.78 0.000 .2885295 .6897173 Effect_3 | .5234567 .1089123 4.81 0.000 .3099923 .7369211 Effect_4 | .5567891 .1145678 4.86 0.000 .3322403 .7813379 Effect_5 | .5789012 .1198234 4.83 0.000 .3440516 .8137508 ------------------------------------------------------------------------------ Placebo tests (pre-treatment effects should be zero) ------------------------------------------------------------------------------ Placebo_1 | -.0123456 .0856789 -0.14 0.885 -.1802722 .1555810 Placebo_2 | .0234567 .0891234 0.26 0.792 -.1512220 .1981354 Placebo_3 | -.0345678 .0923456 -0.37 0.708 -.2155618 .1464262 ------------------------------------------------------------------------------ Joint test for placebos: chi2(3) = 0.78, p = 0.854

R Output

de Chaisemartin and D'Haultfoeuille (2020) Estimator ==================================================== Instantaneous effect at treatment switch-on: Effect_0: 0.3123 (SE: 0.0912) 95% CI: [0.134, 0.491] p-value: 0.0006 Dynamic effects (periods after treatment): Effect | Estimate | SE | 95% CI -------|----------|--------|--------------- 1 | 0.4235 | 0.0978 | [0.232, 0.615] 2 | 0.4891 | 0.1023 | [0.289, 0.690] 3 | 0.5235 | 0.1089 | [0.310, 0.737] 4 | 0.5568 | 0.1146 | [0.332, 0.781] 5 | 0.5789 | 0.1198 | [0.344, 0.814] Placebo tests (pre-treatment periods): Placebo | Estimate | SE | 95% CI --------|-----------|--------|--------------- -1 | -0.0123 | 0.0857 | [-0.180, 0.156] -2 | 0.0235 | 0.0891 | [-0.151, 0.198] -3 | -0.0346 | 0.0923 | [-0.216, 0.146] Joint test for placebos: chi2(3) = 0.78, p = 0.854 [No evidence against parallel trends] [did_multiplegt_plot() shows event study figure]

Sun-Abraham (2021)

* Stata: Sun-Abraham Interaction-Weighted Estimator
* ssc install eventstudyinteract

* Create cohort variable and relative time dummies
gen cohort = first_treat
gen never_treat = (first_treat == 0)

* Sun-Abraham estimator
eventstudyinteract outcome rel_time_*, ///
    cohort(cohort) control_cohort(never_treat) ///
    absorb(unit_id year) vce(cluster unit_id)

# R: Sun-Abraham with fixest
library(fixest)

# Create cohort indicator
df$cohort <- df$first_treat

# Sun-Abraham estimator using sunab()
sa_model <- feols(outcome ~ sunab(cohort, year) | unit_id + year,
                  data = df,
                  cluster = ~ unit_id)
summary(sa_model)

# Aggregate and plot
summary(sa_model, agg = "ATT")
iplot(sa_model)

Stata Output

. eventstudyinteract outcome rel_time_*, /// > cohort(cohort) control_cohort(never_treat) /// > absorb(unit_id year) vce(cluster unit_id) Sun and Abraham (2021) Interaction-Weighted Estimator Control cohort: never_treat IW estimates by relative time to treatment ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- rel_time_5 | -.0178234 .0892345 -0.20 0.842 -.1927199 .1570731 rel_time_4 | .0089123 .0878912 0.10 0.919 -.1633513 .1811759 rel_time_3 | -.0312456 .0912345 -0.34 0.732 -.2100621 .1475709 rel_time_2 | .0234567 .0891234 0.26 0.792 -.1512220 .1981354 rel_time_1 | 0 (reference) rel_time0 | .2891234 .0912456 3.17 0.002 .1102853 .4679615 rel_time1 | .4123456 .0956789 4.31 0.000 .2248173 .5998739 rel_time2 | .4789123 .0998234 4.80 0.000 .2832620 .6745626 rel_time3 | .5234567 .1045678 5.01 0.000 .3185074 .7284060 rel_time4 | .5567891 .1098234 5.07 0.000 .3415391 .7720391 rel_time5 | .5812345 .1145678 5.07 0.000 .3566857 .8057833 ------------------------------------------------------------------------------ Average ATT across all post-treatment periods: 0.4736 (SE: 0.0923)

R Output

OLS estimation, Dep. Var.: outcome Observations: 5,000 Fixed-effects: unit_id: 500, year: 10 Standard-errors: Clustered (unit_id) Sun and Abraham (2021) estimation using cohort-specific effects Estimate Std. Error t value Pr(>|t|) cohort::2014:-5 -0.0178 0.0892 -0.1996 0.8419 cohort::2014:-4 0.0089 0.0879 0.1014 0.9193 cohort::2014:-3 -0.0312 0.0912 -0.3426 0.7320 cohort::2014:-2 0.0235 0.0891 0.2632 0.7925 cohort::2014:0 0.2891 0.0912 3.1696 0.0016 ** cohort::2014:1 0.4123 0.0957 4.3101 < 1e-4 *** cohort::2014:2 0.4789 0.0998 4.7978 < 1e-5 *** cohort::2015:-4 -0.0145 0.0901 -0.1609 0.8723 ... --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Aggregated ATT (average post-treatment effect): ATT SE [95% CI] 0.4736 0.0923 [0.2927, 0.6545] *** [iplot() shows Sun-Abraham event study with cohort-weighted estimates]

Comparing All Estimators

A powerful robustness check is to plot multiple DiD estimators on the same figure, as in Braghieri, Levy, and Makarin (2022) on social media and mental health. If estimators disagree substantially, this signals potential heterogeneity or specification concerns.

Why Compare Estimators?

Different estimators make different assumptions:

TWFE: Assumes homogeneous treatment effects (can be biased with staggered adoption)
Callaway-Sant'Anna: Uses never-treated or not-yet-treated as controls
Sun-Abraham: Interaction-weighted approach with cohort-specific effects
de Chaisemartin-D'Haultfoeuille: Robust to heterogeneous effects and switching treatments
Borusyak et al.: Imputation approach using only untreated observations for counterfactuals

When all estimators agree, this strengthens your causal claims. When they diverge, investigate why.

# Python: Multi-Estimator Comparison Plot (Braghieri et al. style)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D

# After running each estimator, collect the event-study coefficients
# Example structure: dict with estimator -> (rel_time, coef, ci_lower, ci_upper)

results = {
    'TWFE': {'rel_time': rel_times, 'coef': twfe_coefs, 'ci_lo': twfe_ci_lo, 'ci_hi': twfe_ci_hi},
    'Callaway-Sant\'Anna': {'rel_time': rel_times, 'coef': cs_coefs, 'ci_lo': cs_ci_lo, 'ci_hi': cs_ci_hi},
    'Sun-Abraham': {'rel_time': rel_times, 'coef': sa_coefs, 'ci_lo': sa_ci_lo, 'ci_hi': sa_ci_hi},
    'de Chaisemartin-D\'H': {'rel_time': rel_times, 'coef': dcdh_coefs, 'ci_lo': dcdh_ci_lo, 'ci_hi': dcdh_ci_hi},
}

# Create comparison plot
fig, ax = plt.subplots(figsize=(12, 7))

colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
markers = ['o', 's', '^', 'D']
offsets = [-0.15, -0.05, 0.05, 0.15]  # Horizontal jitter

for i, (name, data) in enumerate(results.items()):
    x = np.array(data['rel_time']) + offsets[i]
    y = data['coef']
    yerr = [np.array(y) - np.array(data['ci_lo']),
            np.array(data['ci_hi']) - np.array(y)]

    ax.errorbar(x, y, yerr=yerr, fmt=markers[i], color=colors[i],
                capsize=3, capthick=1.5, label=name, markersize=8, linewidth=1.5)

# Add reference lines
ax.axhline(y=0, color='black', linestyle='-', linewidth=0.8)
ax.axvline(x=-0.5, color='gray', linestyle='--', linewidth=1, alpha=0.7)

# Formatting
ax.set_xlabel('Periods Relative to Treatment', fontsize=12)
ax.set_ylabel('Treatment Effect Estimate', fontsize=12)
ax.set_title('DiD Estimator Comparison', fontsize=14, fontweight='bold')
ax.legend(loc='upper left', framealpha=0.9)
ax.set_xticks(range(min(rel_times), max(rel_times)+1))

plt.tight_layout()
plt.savefig('estimator_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

* Stata: Multi-Estimator Comparison Plot (Braghieri et al. style)
* Run each estimator and store results, then combine in one figure

* 1. Run TWFE event study
reghdfe outcome ib-1.rel_time, absorb(unit_id year) cluster(unit_id)
parmest, saving(twfe_results, replace)

* 2. Run Callaway-Sant'Anna
csdid outcome, ivar(unit_id) time(year) gvar(first_treat)
csdid_stats event
* Export results manually or use csdid's stored matrices

* 3. Run Sun-Abraham
eventstudyinteract outcome rel_time_*, cohort(cohort) ///
    control_cohort(never_treat) absorb(unit_id year) cluster(unit_id)
matrix sa_coef = e(b_iw)
matrix sa_var = e(V_iw)

* 4. Run de Chaisemartin-D'Haultfoeuille
did_multiplegt outcome unit_id year treatment, ///
    robust_dynamic dynamic(5) placebo(3) cluster(unit_id)

* Combine results into one dataset for plotting
clear
input str30 estimator rel_time coef ci_lo ci_hi
* ... manually enter or append saved results ...
end

* Create the comparison plot
twoway (rcap ci_lo ci_hi rel_time if estimator=="TWFE", ///
           lcolor(navy) lwidth(medium)) ///
       (scatter coef rel_time if estimator=="TWFE", ///
           mcolor(navy) msymbol(O) msize(medium)) ///
       (rcap ci_lo ci_hi rel_time if estimator=="CS", ///
           lcolor(cranberry) lwidth(medium)) ///
       (scatter coef rel_time if estimator=="CS", ///
           mcolor(cranberry) msymbol(S) msize(medium)) ///
       (rcap ci_lo ci_hi rel_time if estimator=="SA", ///
           lcolor(forest_green) lwidth(medium)) ///
       (scatter coef rel_time if estimator=="SA", ///
           mcolor(forest_green) msymbol(T) msize(medium)) ///
       (rcap ci_lo ci_hi rel_time if estimator=="dCDH", ///
           lcolor(dkorange) lwidth(medium)) ///
       (scatter coef rel_time if estimator=="dCDH", ///
           mcolor(dkorange) msymbol(D) msize(medium)), ///
       xline(-0.5, lcolor(gray) lpattern(dash)) ///
       yline(0, lcolor(black)) ///
       legend(order(2 "TWFE" 4 "Callaway-Sant'Anna" ///
                    6 "Sun-Abraham" 8 "de Chaisemartin-D'H") ///
              rows(1) position(6)) ///
       xtitle("Periods Relative to Treatment") ///
       ytitle("Treatment Effect Estimate") ///
       title("DiD Estimator Comparison")

graph export "estimator_comparison.png", replace width(1200)

# R: Multi-Estimator Comparison Plot (Braghieri et al. style)
library(tidyverse)
library(fixest)
library(did)

# Run each estimator and extract event-study coefficients

# 1. TWFE
twfe_model <- feols(outcome ~ i(rel_time, ref = -1) | unit_id + year,
                    data = df, cluster = ~unit_id)

# 2. Callaway-Sant'Anna
cs_out <- att_gt(yname = "outcome", tname = "year", idname = "unit_id",
                 gname = "first_treat", data = df, control_group = "nevertreated")
cs_agg <- aggte(cs_out, type = "dynamic")

# 3. Sun-Abraham
sa_model <- feols(outcome ~ sunab(first_treat, year) | unit_id + year,
                  data = df, cluster = ~unit_id)

# Extract coefficients into a tidy data frame
extract_coefs <- function(model, name) {
  coefs <- coef(model)
  se <- sqrt(diag(vcov(model)))
  tibble(
    estimator = name,
    rel_time = parse_number(names(coefs)),
    coef = coefs,
    ci_lo = coefs - 1.96 * se,
    ci_hi = coefs + 1.96 * se
  )
}

all_results <- bind_rows(
  extract_coefs(twfe_model, "TWFE"),
  tibble(
    estimator = "Callaway-Sant'Anna",
    rel_time = cs_agg$egt,
    coef = cs_agg$att.egt,
    ci_lo = cs_agg$att.egt - 1.96 * cs_agg$se.egt,
    ci_hi = cs_agg$att.egt + 1.96 * cs_agg$se.egt
  ),
  extract_coefs(sa_model, "Sun-Abraham"),
  tibble(
    estimator = "Borusyak et al.",
    rel_time = did_imp_es$term,
    coef = did_imp_es$estimate,
    ci_lo = did_imp_es$conf.low,
    ci_hi = did_imp_es$conf.high
  )
)

# Create the comparison plot
ggplot(all_results, aes(x = rel_time, y = coef, color = estimator, shape = estimator)) +
  geom_hline(yintercept = 0, color = "black", linewidth = 0.5) +
  geom_vline(xintercept = -0.5, color = "gray", linetype = "dashed") +
  geom_errorbar(aes(ymin = ci_lo, ymax = ci_hi),
                width = 0.2, linewidth = 0.8,
                position = position_dodge(width = 0.3)) +
  geom_point(size = 3, position = position_dodge(width = 0.3)) +
  scale_color_manual(values = c("TWFE" = "#1f77b4",
                                "Callaway-Sant'Anna" = "#ff7f0e",
                                "Sun-Abraham" = "#2ca02c",
                                "de Chaisemartin-D'H" = "#d62728",
                                "Borusyak et al." = "#9467bd")) +
  scale_shape_manual(values = c("TWFE" = 16,
                                "Callaway-Sant'Anna" = 15,
                                "Sun-Abraham" = 17,
                                "de Chaisemartin-D'H" = 18,
                                "Borusyak et al." = 8)) +
  labs(x = "Periods Relative to Treatment",
       y = "Treatment Effect Estimate",
       title = "DiD Estimator Comparison",
       color = "Estimator", shape = "Estimator") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        panel.grid.minor = element_blank())

ggsave("estimator_comparison.png", width = 10, height = 6, dpi = 300)

Python Output

[Figure: DiD Estimator Comparison] A plot showing four different estimators (TWFE, Callaway-Sant'Anna, Sun-Abraham, de Chaisemartin-D'Haultfoeuille) with their point estimates and 95% confidence intervals across relative time periods. Key observations: - Pre-treatment coefficients (t < 0): All estimators show coefficients near zero, supporting parallel trends assumption - Post-treatment coefficients (t ≥ 0): * All estimators show positive, significant effects * TWFE estimates may be slightly attenuated due to negative weights * Modern estimators (CS, SA, dCDH) show consistent, similar magnitudes - The agreement across estimators strengthens the causal interpretation Saved as: estimator_comparison.png

Generated figures:

Stata Output

[Figure: DiD Estimator Comparison] Four estimators plotted with different colors and markers: - Navy circles: TWFE (traditional two-way fixed effects) - Cranberry squares: Callaway-Sant'Anna (2021) - Forest green triangles: Sun-Abraham (2021) - Orange diamonds: de Chaisemartin-D'Haultfoeuille (2020) Results: - Pre-period (t = -5 to -1): All estimators centered around zero with overlapping confidence intervals - Treatment period (t = 0): Jump in estimates for all methods - Post-period (t = 1 to 5): Persistent effects, slight divergence between TWFE and robust estimators Graph saved as: estimator_comparison.png

R Output

[ggplot2 Figure: DiD Estimator Comparison] Event study plot comparing TWFE, Callaway-Sant'Anna, and Sun-Abraham estimators. Visual summary: - Horizontal line at y = 0 for reference - Vertical dashed line at t = -0.5 marks treatment onset - Position dodge prevents overlapping confidence intervals Interpretation: - Parallel pre-trends: Estimates at t = -5, -4, -3, -2 are statistically indistinguishable from zero across all estimators - Treatment effect onset: Clear jump at t = 0 - Effect persistence: Treatment effects remain elevated through t = 5 - Estimator agreement: All three methods yield qualitatively similar results, though TWFE is slightly smaller in post-periods (consistent with attenuation bias from staggered adoption) Saved: estimator_comparison.png (10" x 6", 300 dpi)

Reference

Braghieri, L., Levy, R., & Makarin, A. (2022). Social Media and Mental Health. American Economic Review, 112(11), 3660-3693. doi:10.1257/aer.20211218

Parallel Trends Diagnostics

The parallel trends assumption is untestable, but you can provide supporting evidence by checking pre-treatment trends.

# Python: Parallel Trends Check
import matplotlib.pyplot as plt

# Plot mean outcomes by group over time
fig, ax = plt.subplots(figsize=(10, 6))

for treated, group_df in df.groupby('treated_group'):
    yearly_means = group_df.groupby('year')['outcome'].mean()
    label = 'Treated' if treated == 1 else 'Control'
    ax.plot(yearly_means.index, yearly_means.values, marker='o', label=label)

ax.axvline(x=treatment_year, color='r', linestyle='--', label='Treatment')
ax.legend()
ax.set_xlabel('Year')
ax.set_ylabel('Mean Outcome')
plt.show()

# Statistical test: pre-treatment coefficients = 0
# (from event study, test joint significance of pre-treatment dummies)

* Stata: Parallel Trends Diagnostics

* Visual check
collapse (mean) outcome, by(year treated_group)
twoway (connected outcome year if treated_group==1) ///
       (connected outcome year if treated_group==0), ///
       xline(2015, lpattern(dash)) ///
       legend(label(1 "Treated") label(2 "Control"))

* Test pre-treatment coefficients jointly = 0
reghdfe outcome ib(-1).rel_time, absorb(unit_id year) cluster(unit_id)
testparm *rel_time#*  // for negative rel_time values

* Placebo test: fake treatment timing
gen fake_post = (year >= 2013)  // true treatment in 2015
gen fake_treat_post = treated_group * fake_post
reg outcome treated_group fake_post fake_treat_post if year < 2015

# R: Parallel Trends Diagnostics
library(ggplot2)
library(dplyr)

# Visual check
df %>%
  group_by(year, treated_group) %>%
  summarise(mean_outcome = mean(outcome)) %>%
  ggplot(aes(x = year, y = mean_outcome, color = factor(treated_group))) +
  geom_line() +
  geom_point() +
  geom_vline(xintercept = 2015, linetype = "dashed") +
  labs(color = "Group", y = "Mean Outcome")

# Test pre-treatment coefficients
es_model <- feols(outcome ~ i(rel_time, ref = -1) | unit_id + year,
                  data = df, cluster = ~ unit_id)

# F-test for pre-treatment coefficients = 0
pretreat_coefs <- grep("rel_time::-", names(coef(es_model)), value = TRUE)
wald(es_model, pretreat_coefs)

Python Output

Parallel Trends Diagnostic ========================== Mean Outcomes by Group and Year: Treated Control Difference 2010 3.12 3.08 0.04 2011 3.25 3.21 0.04 2012 3.38 3.33 0.05 2013 3.51 3.47 0.04 2014 3.64 3.59 0.05 -------- TREATMENT (2015) -------- 2015 4.21 3.72 0.49 2016 4.35 3.84 0.51 2017 4.48 3.96 0.52 2018 4.61 4.09 0.52 2019 4.73 4.21 0.52 Pre-treatment difference is stable (~0.04-0.05) Post-treatment gap widens to ~0.50 Joint test of pre-treatment coefficients: F(4, 495) = 0.76 p-value = 0.552 Conclusion: Cannot reject parallel pre-trends [Plot shows parallel lines pre-2015, diverging after treatment]

Generated figures:

Stata Output

. testparm L(-5/-2).rel_time ( 1) -5.rel_time = 0 ( 2) -4.rel_time = 0 ( 3) -3.rel_time = 0 ( 4) -2.rel_time = 0 F( 4, 499) = 0.76 Prob > F = 0.5523 Conclusion: Cannot reject H0 that all pre-treatment coefficients = 0 This supports (but does not prove) parallel trends . reg outcome treated_group fake_post fake_treat_post if year < 2015 Source | SS df MS Number of obs = 2,500 -------------+---------------------------------- F(3, 2496) = 0.89 Model | 2.34567891 3 .781892970 Prob > F = 0.4456 Residual | 2189.12345 2,496 .877244171 R-squared = 0.0011 -------------+---------------------------------- Adj R-squared = -0.0001 Total | 2191.46913 2,499 .876939227 Root MSE = .93662 ------------------------------------------------------------------------------ outcome | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- treated_gr~p | .0412345 .0456789 0.90 0.367 -.0483456 .1308146 fake_post | .1234567 .0478912 2.58 0.010 .0295234 .2173900 fake_treat~t | .0078912 .0567891 0.14 0.889 -.1034567 .1192391 _cons | 3.123456 .0345678 90.35 0.000 3.055678 3.191234 ------------------------------------------------------------------------------ Placebo DiD estimate: 0.008 (p = 0.889) [No spurious "effect" in pre-treatment period]

R Output

# Visual inspection shows parallel pre-trends # (Plot displayed with parallel lines before treatment year) Wald test for joint significance of pre-treatment coefficients: H0: All pre-treatment coefficients = 0 Tested coefficients: rel_time::-5, rel_time::-4, rel_time::-3, rel_time::-2 Wald stat: 3.04 on 4 df p-value: 0.5510 Conclusion: Cannot reject H0 (parallel trends assumption supported) Note: This test has limited power. Even with p > 0.05: - Parallel trends could be violated - Small violations may be economically meaningful - Consider sensitivity analysis (Rambachan & Roth, 2023) Pre-treatment coefficient magnitudes: All coefficients < 0.05 in absolute value All 95% CIs include zero No evidence of differential pre-trends

Goodman-Bacon Decomposition

TWFE DiD with staggered adoption is a weighted average of all possible 2×2 DiD comparisons. Goodman-Bacon (2021) decomposes the TWFE estimate into three types of comparisons: (1) earlier-treated vs. later-treated, (2) later-treated vs. earlier-treated, and (3) treated vs. never-treated. The problematic comparisons are those that use already-treated units as controls. Visualizing the weights vs. estimates by comparison type reveals whether bias is likely.

# Python: No direct package, but can implement manually
# or use R via rpy2. Conceptual implementation:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Get unique treatment cohorts
cohorts = df[df['treatment'] == 1].groupby('unit_id')['year'].min().unique()

# For each pair of cohorts, compute 2x2 DiD
# (Simplified — full implementation requires careful timing)
# In practice, use R's bacondecomp or Stata's bacondecomp

# Visualization of results (after computing in R/Stata)
# bacon_df = pd.read_csv('bacon_decomposition.csv')
# plt.scatter(bacon_df['weight'], bacon_df['estimate'],
#             c=bacon_df['type'].map(color_map))
# plt.xlabel('Weight'); plt.ylabel('2x2 DiD Estimate')
# plt.axhline(y=0, linestyle='--', color='gray')
# plt.title('Goodman-Bacon Decomposition')
# plt.show()

* Install: ssc install bacondecomp
* Goodman-Bacon decomposition
bacondecomp outcome treatment, id(unit_id) time(year)

* The command automatically produces:
* - Table with 2x2 estimates, weights, and comparison types
* - Scatter plot of weight vs estimate

library(bacondecomp)
# Goodman-Bacon decomposition of TWFE
bacon_out <- bacon(outcome ~ treatment, data = df,
                   id_var = "unit_id", time_var = "year")
# View decomposition
print(bacon_out)

# Visualization: weight vs estimate by comparison type
library(ggplot2)
ggplot(bacon_out, aes(x = weight, y = estimate,
                       shape = type, color = type)) +
  geom_point(size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Weight", y = "2x2 DiD Estimate",
       title = "Goodman-Bacon Decomposition",
       color = "Comparison Type", shape = "Comparison Type") +
  theme_minimal()

# Weighted average = TWFE coefficient
cat("TWFE estimate:", sum(bacon_out$estimate * bacon_out$weight), "\n")

Python Output

Goodman-Bacon Decomposition (conceptual output) ================================================= Note: Use R's bacondecomp or Stata's bacondecomp for full results. Python implementation requires manual computation of all 2x2 DiD pairs. Treatment cohorts identified: [2012, 2014, 2016] Number of 2x2 comparisons: 6 [Scatter plot: Weight (x-axis) vs. 2x2 DiD Estimate (y-axis) - Blue triangles: Treated vs. Never-Treated (large weights, positive estimates) - Red circles: Earlier vs. Later Treated (moderate weights, positive estimates) - Orange squares: Later vs. Earlier Treated (small weights, negative estimates) - Dashed horizontal line at y=0]

Generated figures:

Stata Output

. bacondecomp outcome treatment, id(unit_id) time(year) Bacon Decomposition ==================== Overall TWFE estimate: 0.4183 ---------------------------------------------------------------------- Type | Estimate Weight Wt. Avg. -------------------------------+-------------------------------------- Treated vs Never Treated | 0.5823 0.6120 0.3564 Earlier Treated vs Later (T) | 0.3456 0.2340 0.0809 Later Treated vs Earlier (T) | -0.1234 0.1540 -0.0190 -------------------------------+-------------------------------------- Total | 1.0000 0.4183 Note: TWFE estimate equals the weighted average of all 2x2 comparisons. Later-vs-Earlier comparisons produce negative estimates, indicating already-treated units are poor controls. [Scatter plot displayed: Weight vs 2x2 DiD Estimate by comparison type]

R Output

type estimate weight 1 Treated vs Never Treated 0.58235 0.61200 2 Earlier vs Later Treated 0.34561 0.23400 3 Later vs Earlier Treated -0.12340 0.15400 TWFE estimate: 0.4183 [ggplot scatter: Weight (x) vs 2x2 DiD Estimate (y) Points colored by comparison type: - "Treated vs Never Treated": blue, largest weights (~0.6), positive estimates - "Earlier vs Later Treated": green, moderate weights (~0.2), positive estimates - "Later vs Earlier Treated": red, small weights (~0.15), negative estimates Dashed horizontal line at y = 0] Note: The negative estimate for "Later vs Earlier" comparisons reveals that already-treated units are contaminated controls.

Reference

Goodman-Bacon, A. (2021). "Difference-in-Differences with Variation in Treatment Timing." Journal of Econometrics, 225(2), 254-277.

TWFE Weight Diagnostics

de Chaisemartin & D'Haultfoeuille (2020) show that TWFE weights can be negative. When weights are negative and treatment effects are heterogeneous, TWFE can be severely biased—even producing wrong signs. The twowayfeweights command computes and visualizes these weights. If many weights are negative, use robust estimators instead.

# Python: Use R via rpy2 or implement manually
# The key insight is computing the regression weights
# from the TWFE regression and checking for negatives

# Conceptual check using fixest residualization
import statsmodels.api as sm

# Demean by unit and time (TWFE residualization)
df['outcome_dm'] = df.groupby('unit_id')['outcome'].transform('demean')
df['outcome_dm'] -= df.groupby('year')['outcome_dm'].transform('mean')
df['treatment_dm'] = df.groupby('unit_id')['treatment'].transform('demean')
df['treatment_dm'] -= df.groupby('year')['treatment_dm'].transform('mean')

# The implicit weight on each observation is proportional to treatment_dm
# Negative treatment_dm for treated observations → negative weights
print("Observations with negative implicit weights:",
      (df.loc[df['treatment']==1, 'treatment_dm'] < 0).sum())

* Install: ssc install twowayfeweights
* Compute TWFE weights
twowayfeweights outcome unit_id year treatment, type(feTR)

* The command reports:
* - Number of positive and negative weights
* - Sum of negative weights (measure of potential bias)
* - Under what assumptions TWFE is valid

* With sensitivity analysis
twowayfeweights outcome unit_id year treatment, type(feTR) ///
    summary_measures

library(TwoWayFEWeights)
# Compute TWFE weights
weights_out <- twowayfeweights(df, "outcome", "unit_id", "year", "treatment",
                                type = "feTR")
summary(weights_out)

# Check for negative weights
neg_weights <- weights_out$weights[weights_out$weights < 0]
cat("Number of negative weights:", length(neg_weights), "\n")
cat("Sum of negative weights:", sum(neg_weights), "\n")

# Histogram of weights
hist(weights_out$weights, breaks = 30, main = "TWFE Weights Distribution",
     xlab = "Weight", col = ifelse(weights_out$weights < 0, "red", "steelblue"))
abline(v = 0, lty = 2, lwd = 2)

Python Output

TWFE Weight Diagnostics (conceptual implementation) ===================================================== Demeaning outcome and treatment by unit and time fixed effects... Observations with negative implicit weights: 47 Among 250 treated observations: Positive implicit weights: 203 (81.2%) Negative implicit weights: 47 (18.8%) Warning: 18.8% of treated observations receive negative weight. With heterogeneous treatment effects, TWFE may be biased. Consider using robust estimators (Callaway-Sant'Anna, Sun-Abraham, etc.)

Stata Output

. twowayfeweights outcome unit_id year treatment, type(feTR) Under the common trends assumption, beta estimates a weighted sum of 250 ATTs. Positive weights: 203 weights summing to 1.1834 Negative weights: 47 weights summing to -0.1834 Total weight: 1.0000 Minimal value of beta under common trends: 0.2345 Maximal value of beta under common trends: 0.7891 The TWFE coefficient could be anywhere between [0.23, 0.79] depending on the pattern of treatment effect heterogeneity. Warning: Negative weights detected. If treatment effects are heterogeneous, the TWFE coefficient may not be a consistent estimate of any meaningful causal parameter.

R Output

Two-Way FE Weights Analysis ============================ Number of group-time cells: 60 Positive weights: 48 (sum = 1.1834) Negative weights: 12 (sum = -0.1834) Number of negative weights: 12 Sum of negative weights: -0.1834 [Histogram: TWFE Weights Distribution - Blue bars: positive weights (majority, right of zero) - Red bars: negative weights (12 cells, left of zero) - Dashed vertical line at weight = 0 - Most weights concentrated between 0 and 0.05 - Negative weights range from -0.04 to -0.01] Interpretation: 20% of group-time cells receive negative weight. TWFE may be biased with heterogeneous effects.

When to Abandon TWFE

If a substantial fraction of weights are negative, do NOT rely on TWFE. Use one of the robust estimators covered in this module instead.

Borusyak et al. Imputation Estimator

Borusyak, Jaravel & Spiess (2024) propose an imputation approach: (1) estimate the unit FE + time FE model using ONLY untreated observations, (2) impute counterfactual outcomes for treated observations, (3) the treatment effect is the difference between actual and imputed outcomes. This avoids the negative weighting problem entirely.

# Python: No direct package yet
# Conceptual implementation of the imputation approach:
import statsmodels.api as sm
import pandas as pd

# Step 1: Estimate FE model on untreated observations only
untreated = df[df['treatment'] == 0]
# Add unit and time dummies
untreated_dummies = pd.get_dummies(untreated[['unit_id', 'year']],
                                     columns=['unit_id', 'year'])
model_untreated = sm.OLS(untreated['outcome'], untreated_dummies).fit()

# Step 2: Predict counterfactual for treated observations
treated = df[df['treatment'] == 1]
treated_dummies = pd.get_dummies(treated[['unit_id', 'year']],
                                   columns=['unit_id', 'year'])
# Align columns
treated_dummies = treated_dummies.reindex(columns=untreated_dummies.columns, fill_value=0)
y0_hat = model_untreated.predict(treated_dummies)

# Step 3: ATT = mean(Y1 - Y0_hat)
att = (treated['outcome'].values - y0_hat).mean()
print(f"Imputation ATT: {att:.4f}")

# In practice, use R's didimputation or Stata's did_imputation

* Install: ssc install did_imputation
* Borusyak et al. imputation estimator

* Static ATT
did_imputation outcome unit_id year first_treated, allhorizons pretrends(5)

* Event study
did_imputation outcome unit_id year first_treated, ///
    horizons(0/5) pretrends(5)

* The command reports ATT by horizon (event time)
* with standard errors clustered at unit level

library(didimputation)
# Borusyak et al. imputation estimator
# Static ATT
did_imp <- did_imputation(data = df, yname = "outcome",
                           gname = "first_treated",  # cohort indicator
                           tname = "year", idname = "unit_id")
summary(did_imp)

# Event study with pre-trends
did_imp_es <- did_imputation(data = df, yname = "outcome",
                              gname = "first_treated",
                              tname = "year", idname = "unit_id",
                              horizon = TRUE, pretrends = TRUE)
summary(did_imp_es)

# Plot event study
library(ggplot2)
plot(did_imp_es)  # built-in plotting method

Python Output

Borusyak et al. Imputation Estimator (conceptual) =================================================== Step 1: FE model estimated on 3,750 untreated observations Unit FEs: 50 units Time FEs: 10 periods R-squared (within): 0.891 Step 2: Counterfactual Y(0) imputed for 1,250 treated observations Step 3: Treatment effect = Actual - Imputed Imputation ATT: 0.5734 In practice, use R's didimputation or Stata's did_imputation for proper standard errors and event-study decomposition.

Stata Output

. did_imputation outcome unit_id year first_treated, allhorizons pretrends(5) did_imputation: Borusyak, Jaravel, and Spiess (2024) ===================================================== ATT by horizon (event time): ---------------------------------------------------------------------- Horizon | Estimate Std. Err. t P>|t| [95% CI] -------------+-------------------------------------------------------- -5 | -0.0123 0.0456 -0.27 0.787 -.102 .077 -4 | 0.0089 0.0423 0.21 0.834 -.074 .092 -3 | -0.0234 0.0478 -0.49 0.625 -.117 .071 -2 | 0.0156 0.0412 0.38 0.705 -.065 .097 -1 | 0.0000 (ref) . . . . 0 | 0.4567 0.0789 5.79 0.000 .302 .612 1 | 0.5234 0.0823 6.36 0.000 .362 .685 2 | 0.5891 0.0891 6.61 0.000 .414 .764 3 | 0.6123 0.0934 6.56 0.000 .429 .796 4 | 0.5789 0.1012 5.72 0.000 .380 .778 5 | 0.6012 0.1089 5.52 0.000 .388 .815 ---------------------------------------------------------------------- Overall ATT: 0.5603 (SE: 0.0678) Pre-trend test: F(4, 49) = 0.34, p = 0.849 [No evidence against parallel trends]

R Output

Borusyak et al. (2024) Imputation Estimator ============================================= Static ATT: Estimate: 0.5603 Std. Error: 0.0678 t-value: 8.264 p-value: < 2e-16 *** 95% CI: [0.427, 0.694] Event Study (with pre-trends): horizon estimate std.error p.value -5 -0.0123 0.0456 0.787 -4 0.0089 0.0423 0.834 -3 -0.0234 0.0478 0.625 -2 0.0156 0.0412 0.705 -1 0.0000 ref ref 0 0.4567 0.0789 0.000 1 0.5234 0.0823 0.000 2 0.5891 0.0891 0.000 3 0.6123 0.0934 0.000 4 0.5789 0.1012 0.000 5 0.6012 0.1089 0.000 [Event study plot: pre-treatment coefficients near zero, post-treatment coefficients positive and increasing, 95% CIs shown as vertical bars, dashed line at y = 0]

Reference

Borusyak, K., Jaravel, X., & Spiess, J. (2024). "Revisiting Event-Study Designs: Robust and Efficient Estimation." Review of Economic Studies, 91(6), 3253-3285.

Fuzzy Difference-in-Differences

Standard (sharp) DiD assumes perfect compliance—all units in the treatment group actually receive treatment. Fuzzy DiD handles imperfect compliance, analogous to fuzzy RDD. The idea: use the group×post interaction as an instrument for actual treatment. This yields a LATE for compliers, just as in IV estimation.

from linearmodels.iv import IV2SLS
from linearmodels.panel import PanelOLS
import pandas as pd

# Set up panel data
df = df.set_index(['unit_id', 'year'])

# Sharp DiD
sharp = PanelOLS(df['outcome'], df[['treatment_actual']],
                  entity_effects=True, time_effects=True).fit(
                  cov_type='clustered', cluster_entity=True)

# Fuzzy DiD with IV
# Use treatment_assigned as instrument for treatment_actual
fuzzy = IV2SLS(df['outcome'],
               exog=None,
               endog=df[['treatment_actual']],
               instruments=df[['treatment_assigned']]).fit(
               cov_type='clustered', clusters=df.index.get_level_values(0))
print(fuzzy.summary)

* Sharp DiD
reghdfe outcome treatment_actual, absorb(unit_id year) cluster(unit_id)

* Fuzzy DiD: IV within DiD framework
* treatment_assigned = group x post interaction (intent to treat)
ivreghdfe outcome (treatment_actual = treatment_assigned), ///
    absorb(unit_id year) cluster(unit_id)

* Or equivalently with xtivreg
xtivreg outcome (treatment_actual = treatment_assigned) i.year, ///
    fe vce(cluster unit_id)

* Check first stage
estat firststage

library(fixest)
# Sharp DiD (standard)
sharp <- feols(outcome ~ treatment_actual | unit_id + year, data = df)

# Fuzzy DiD: instrument actual treatment with assignment
# The DiD interaction (group x post) instruments for actual treatment
fuzzy <- feols(outcome ~ 1 | unit_id + year |
               treatment_actual ~ treatment_assigned, data = df)
summary(fuzzy)

# The coefficient on treatment_actual is now the LATE
# (Local Average Treatment Effect for compliers)

# First-stage F-statistic
fitstat(fuzzy, "ivf")  # should be > 10

Python Output

Sharp DiD (PanelOLS): treatment_actual: 0.4523 (SE: 0.0891, p < 0.001) Fuzzy DiD (IV2SLS): IV-2SLS Estimation Summary ================================================================================ Dep. Variable: outcome R-squared: 0.3891 Estimator: IV-2SLS Adj. R-squared: 0.3845 No. Observations: 5000 F-statistic: 45.23 Parameter Estimates ================================================================================ Parameter Std. Err. T-stat P-value Lower CI Upper CI -------------------------------------------------------------------------------- treatment_actual 0.7234 0.1456 4.968 0.000 0.438 1.009 ================================================================================ First-stage F-statistic: 34.56 (> 10, instrument is strong) Interpretation: Sharp DiD (ITT): 0.452 — effect of being assigned to treatment Fuzzy DiD (LATE): 0.723 — effect for compliers (those who actually took up) Compliance rate: ~62.5% (= ITT / LATE)

Stata Output

. reghdfe outcome treatment_actual, absorb(unit_id year) cluster(unit_id) Sharp DiD: ------------------------------------------------------------------------------ | Robust outcome | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- treatment_~l | .4523456 .0891234 5.08 0.000 .2726789 .6320123 ------------------------------------------------------------------------------ . ivreghdfe outcome (treatment_actual = treatment_assigned), /// > absorb(unit_id year) cluster(unit_id) Fuzzy DiD (IV): (Std. Err. adjusted for 50 clusters in unit_id) ------------------------------------------------------------------------------ | Robust outcome | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- treatment_~l | .7234567 .1456789 4.97 0.000 .4306234 1.016290 ------------------------------------------------------------------------------ Underidentification test (Kleibergen-Paap rk LM statistic): 34.56 Chi-sq(1) P-val = 0.0000 Weak identification test (Kleibergen-Paap rk Wald F statistic): 34.56 Stock-Yogo weak ID test critical values: 10% maximal IV size 16.38 . estat firststage First-stage regression summary statistics ------------------------------------------------------- Adjusted Partial Variable | R-sq. R-sq. R-sq. F(1,49) -------------+----------------------------------------- treatment_~l | 0.5234 0.5123 0.4123 34.56

R Output

TSLS estimation - Pair 1 Dep. var.: outcome Observations: 5,000 Standard-errors: Clustered (unit_id) Estimate Std. Error t value Pr(>|t|) treatment_actual 0.72346 0.14568 4.965 1.2e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 IV diagnostics: First-stage F-stat (treatment_actual): 34.56 (p < 0.001) Wu-Hausman test: 5.67 (p = 0.017) Comparison: Sharp DiD (OLS): 0.452 — Intent-to-Treat (ITT) effect Fuzzy DiD (IV): 0.723 — Local Average Treatment Effect (LATE) Implied compliance: 62.5%

Interpreting Fuzzy DiD

Fuzzy DiD estimates the LATE for compliers—units whose actual treatment status changes because of the policy. This is the same interpretation as in IV/2SLS. The identifying assumption combines parallel trends (for DiD) with an exclusion restriction (the instrument only affects outcomes through actual treatment).

Reference

de Chaisemartin, C. & D'Haultfoeuille, X. (2018). "Fuzzy Differences-in-Differences." Review of Economic Studies, 85(2), 999-1028.

HonestDiD Sensitivity Analysis

Standard parallel trends tests (pre-treatment coefficient = 0) have low power (Roth, 2022). HonestDiD (Rambachan & Roth, 2023) asks: how robust are your results to violations of parallel trends? It computes confidence intervals that remain valid even if pre-trends are violated by up to a magnitude M. The key output is a sensitivity plot showing how CIs change as the allowed violation M increases.

# Python: No direct package — use R via rpy2
# Conceptual approach:
import numpy as np

# After running event study, extract coefficients and vcov
# betahat = np.array([...])  # event study coefficients
# sigma = np.array([[...]])   # variance-covariance matrix

# The idea: for a given M (maximum violation of parallel trends),
# compute the worst-case bias and adjust confidence intervals
# M = 0: standard CI (assumes exact parallel trends)
# M > 0: wider CI (allows for some violation)

# In practice, use R's HonestDiD package:
# from rpy2.robjects.packages import importr
# honestdid = importr('HonestDiD')
# sensitivity = honestdid.createSensitivityResults_relativeMagnitudes(
#     betahat=betahat, sigma=sigma,
#     numPrePeriods=4, numPostPeriods=5)
print("Use R's HonestDiD package for sensitivity analysis")
print("See: https://github.com/asheshrambachan/HonestDiD")

* Install: ssc install honestdid
* Step 1: Run event study
reghdfe outcome ib(-1).rel_time, absorb(unit_id year) cluster(unit_id)

* Step 2: HonestDiD sensitivity analysis
honestdid, pre(4) post(5) mvec(0 0.5 1 1.5 2)

* The command produces a sensitivity plot showing
* how confidence intervals widen as M increases.
* If CIs still exclude zero at reasonable M values,
* the result is robust to plausible parallel trends violations.

library(HonestDiD)
library(fixest)

# Step 1: Run event study
es <- feols(outcome ~ i(rel_time, ref = c(-1, -Inf)) | unit_id + year,
            data = df, cluster = ~unit_id)

# Step 2: Extract coefficients and variance-covariance matrix
betahat <- coef(es)
sigma <- vcov(es)

# Step 3: Sensitivity analysis — relative magnitudes approach
sensitivity <- createSensitivityResults_relativeMagnitudes(
  betahat = betahat,
  sigma = sigma,
  numPrePeriods = 4,    # number of pre-treatment periods
  numPostPeriods = 5,   # number of post-treatment periods
  Mbarvec = seq(0, 2, by = 0.5)  # grid of M values
)

# Step 4: Plot sensitivity
createSensitivityPlot_relativeMagnitudes(sensitivity, rescaleFactor = 1)

# Original CI (M=0) vs robust CIs
print(sensitivity)

Python Output

HonestDiD Sensitivity Analysis ================================ Note: Use R's HonestDiD package for this analysis. Python implementation not yet available. See: https://github.com/asheshrambachan/HonestDiD Conceptual output (from R): M = 0.0: CI = [0.427, 0.694] (standard parallel trends) M = 0.5: CI = [0.312, 0.809] M = 1.0: CI = [0.198, 0.923] M = 1.5: CI = [0.083, 1.038] M = 2.0: CI = [-0.031, 1.152] Result is robust to violations up to M = 1.5 (CI excludes zero for M ≤ 1.5)

Stata Output

. honestdid, pre(4) post(5) mvec(0 0.5 1 1.5 2) HonestDiD Sensitivity Analysis (Rambachan & Roth, 2023) ========================================================= Relative Magnitudes Approach: (Maximum post-treatment violation relative to max pre-trend) ---------------------------------------------------------------------- Mbar | Lower CI Upper CI Excludes Zero? ----------+----------------------------------------------------------- 0.00 | 0.4270 0.6940 Yes 0.50 | 0.3120 0.8090 Yes 1.00 | 0.1980 0.9230 Yes 1.50 | 0.0830 1.0380 Yes 2.00 | -0.0310 1.1520 No ---------------------------------------------------------------------- Interpretation: The treatment effect remains statistically significant for M ≤ 1.5 (post-treatment violations up to 1.5x the max pre-trend). At M = 2.0, the CI includes zero. [Sensitivity plot displayed: M (x-axis) vs. confidence interval bounds]

R Output

HonestDiD Sensitivity Analysis ================================ Relative Magnitudes Approach: Mbar lb.robust ub.robust 0.0 0.4270 0.6940 0.5 0.3120 0.8090 1.0 0.1980 0.9230 1.5 0.0830 1.0380 2.0 -0.0310 1.1520 [Sensitivity Plot: - X-axis: Mbar (allowed violation magnitude, 0 to 2) - Y-axis: Confidence interval bounds - Shaded region between lower and upper bounds - Dashed line at y = 0 - CI excludes zero for Mbar in [0, 1.5] - CI includes zero at Mbar = 2.0] Interpretation: Original estimate: 0.560 (M = 0, exact parallel trends) Breakdown value: M ≈ 1.85 (The result is overturned only if post-treatment violations exceed 1.85x the maximum pre-treatment trend deviation)

Why Pre-Trend Tests Are Not Enough

Roth (2022) shows that standard pre-trend tests have low power—failing to reject parallel trends does NOT mean trends are actually parallel. HonestDiD addresses this by asking how large parallel trends violations need to be before the conclusions change, providing a more informative robustness check.

References

Rambachan, A. & Roth, J. (2023). "A More Credible Approach to Parallel Trends." Review of Economic Studies, 90(5), 2555-2591.
Roth, J. (2022). "Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends." AER: Insights, 4(3), 305-322.

Key Papers on Modern DiD

Callaway, B. & Sant'Anna, P. (2021). "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics.
de Chaisemartin, C. & D'Haultfoeuille, X. (2020). "Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects." AER.
Sun, L. & Abraham, S. (2021). "Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects." Journal of Econometrics.
Goodman-Bacon, A. (2021). "Difference-in-Differences with Variation in Treatment Timing." Journal of Econometrics.
Roth, J. (2022). "Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends." AER: Insights.
Borusyak, K., Jaravel, X., & Spiess, J. (2024). "Revisiting Event-Study Designs: Robust and Efficient Estimation." Review of Economic Studies, 91(6), 3253-3285.
Rambachan, A. & Roth, J. (2023). "A More Credible Approach to Parallel Trends." Review of Economic Studies, 90(5), 2555-2591.
de Chaisemartin, C. & D'Haultfoeuille, X. (2018). "Fuzzy Differences-in-Differences." Review of Economic Studies, 85(2), 999-1028.
de Chaisemartin, C. & D'Haultfoeuille, X. (2022). "Two-Way Fixed Effects and Differences-in-Differences with Heterogeneous Treatment Effects: A Survey." Econometrics Journal, 25(3), 493-510.

ProTools ER1

Course Modules

6B Difference-in-Differences

Table of Contents

Classic 2x2 DiD

Two-Way Fixed Effects

Event Study Design

Staggered Adoption: The Problem

Modern DiD Estimators

Callaway-Sant'Anna (2021)

de Chaisemartin-D'Haultfoeuille (2020)

Sun-Abraham (2021)

Comparing All Estimators

Parallel Trends Diagnostics

Goodman-Bacon Decomposition

TWFE Weight Diagnostics

Borusyak et al. Imputation Estimator

Fuzzy Difference-in-Differences

HonestDiD Sensitivity Analysis