6A  Matching

~2 hours PSM, CEM, Nearest Neighbor

The Logic of Matching

The Fundamental Problem

We want to estimate the effect of a treatment D on outcome Y. The problem: treated and untreated units may differ systematically in ways that also affect Y. For example, people who participate in job training programs (treatment) may be more motivated—and motivation affects employment (outcome) regardless of training.

Matching solution: Find untreated units that "look like" treated units on observable characteristics X. If we can find comparison units with identical X values, comparing outcomes gives us a causal effect—provided there are no unmeasured confounders.

Visual Intuition: How Matching Works

Step 1: Raw Data
Income Age

Treated (red) and control (blue) differ systematically

Step 2: After Matching
Income Age

Compare only similar units (matched pairs)

Further Reading: Theory & Intuition

Propensity Score Matching

The propensity score is the probability of treatment given observed covariates: e(X) = P(D=1|X). The key insight (Rosenbaum & Rubin, 1983) is that conditioning on the propensity score balances all covariates, reducing many dimensions to one.

What to look for in the code

The process has 4 steps: (1) Estimate propensity scores using logistic regression, (2) Match treated to control units with similar scores, (3) Verify balance improved on all covariates, (4) Estimate treatment effect on matched sample. Watch for overlap/common support—if propensity scores for treated and control don't overlap, matching is impossible.

# Python: Propensity Score Matching
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

# Step 1: Estimate propensity scores
X = df[['age', 'income', 'education']]
y = df['treatment']

model = LogisticRegression(max_iter=1000)
model.fit(X, y)
df['pscore'] = model.predict_proba(X)[:, 1]

# Step 2: Match treated to nearest control
treated = df[df['treatment'] == 1]
control = df[df['treatment'] == 0]

nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['pscore']])
distances, indices = nn.kneighbors(treated[['pscore']])

# Step 3: Create matched sample
matched_control = control.iloc[indices.flatten()]
matched_df = pd.concat([treated.reset_index(drop=True),
                        matched_control.reset_index(drop=True)],
                       keys=['treated', 'control'])

# Step 4: Estimate ATT
att = treated['outcome'].mean() - matched_control['outcome'].mean()
print(f"ATT: {att:.3f}")
* Stata: Propensity Score Matching with teffects

* Method 1: teffects psmatch (built-in, recommended)
teffects psmatch (outcome) (treatment age income education), ///
    atet nn(1)

* With caliper (maximum allowed distance)
teffects psmatch (outcome) (treatment age income education), ///
    atet nn(1) caliper(0.1)

* Method 2: psmatch2 (community-contributed, more options)
* ssc install psmatch2
psmatch2 treatment age income education, ///
    outcome(outcome) logit common

* Check balance after matching
pstest age income education, treated(treatment)
# R: Propensity Score Matching with MatchIt
library(MatchIt)
library(cobalt)

# Estimate propensity scores and match
m.out <- matchit(treatment ~ age + income + education,
                 data = df,
                 method = "nearest",
                 distance = "glm",  # logistic regression
                 caliper = 0.1)

# Check match quality
summary(m.out)

# Extract matched data
matched_df <- match.data(m.out)

# Estimate ATT on matched sample
att_model <- lm(outcome ~ treatment, data = matched_df,
                weights = weights)
summary(att_model)
Python Output
Propensity Score Summary: Mean (Treated): 0.724 Mean (Control): 0.312 Overlap region: [0.08, 0.95] Matching Results: Treated units: 247 Matched controls: 247 Avg. match distance: 0.023 ATT: 0.342
Stata Output
Treatment-effects estimation Number of obs = 500 Estimator : propensity-score matching Matches: requested = 1 Outcome model : matching min = 1 Treatment model: logit max = 3 ------------------------------------------------------------------------------ | AI Robust outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- ATET | treatment | (1 vs 0) | .3421587 .0893214 3.83 0.000 .1670919 .5172255 ------------------------------------------------------------------------------ psmatch2: Treatment assignment psmatch2: Common support | Off support On support | Total -------------+--------------------------------+---------- Untreated | 0 253 | 253 Treated | 5 242 | 247 -------------+--------------------------------+---------- Total | 5 495 | 500
R Output
Call: matchit(formula = treatment ~ age + income + education, data = df, method = "nearest", distance = "glm", caliper = 0.1) Summary of Balance for All Data: Means Treated Means Control Std. Mean Diff. age 42.31 38.76 0.312 income 52847.23 45231.89 0.428 education 14.2 12.8 0.387 Summary of Balance for Matched Data: Means Treated Means Control Std. Mean Diff. age 41.89 41.67 0.019 income 51234.56 50987.32 0.014 education 13.9 13.8 0.028 Sample Sizes: Control Treated All 253 247 Matched 238 238 Unmatched 15 9 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.2341 0.0672 63.012 <2e-16 *** treatment 0.3389 0.0891 3.803 0.00016 ***

Interpreting PSM Results

Example output: ATT: 0.342 (SE: 0.089)

This means: among those who received treatment, the average effect was a 0.342 unit increase in the outcome compared to what they would have experienced without treatment. The ATT (Average Treatment effect on the Treated) is the relevant estimand when you're evaluating a voluntary program—we want to know whether the program helped those who participated.

Key checks: (1) How many treated units were matched? (2) What's the average distance between matched pairs? (3) Did covariate balance improve? If you lost many treated units to trimming, your estimate applies only to the "matchable" population, not all treated units.

Coarsened Exact Matching

CEM (Iacus et al., 2012) temporarily coarsens covariates into bins, performs exact matching within bins, then uses the original (non-coarsened) data for analysis. This avoids propensity score model dependence and guarantees a maximum imbalance bound.

When to use CEM vs PSM

Use CEM when you have a clear sense of what covariate values should be "close enough" (e.g., ages 30-35 are comparable). CEM discards more data but provides stronger balance guarantees. Use PSM when covariates are numerous or continuous without natural groupings.

# Python: Coarsened Exact Matching
import pandas as pd
import numpy as np

# Step 1: Coarsen variables into bins
df['age_bin'] = pd.cut(df['age'], bins=[18, 30, 45, 60, 100])
df['income_bin'] = pd.cut(df['income'], bins=5)
df['edu_bin'] = df['education']  # already categorical

# Step 2: Create strata
df['strata'] = df.groupby(['age_bin', 'income_bin', 'edu_bin']).ngroup()

# Step 3: Keep only matched strata (both treated and control)
strata_counts = df.groupby(['strata', 'treatment']).size().unstack(fill_value=0)
matched_strata = strata_counts[(strata_counts[0] > 0) & (strata_counts[1] > 0)].index
df_matched = df[df['strata'].isin(matched_strata)]

# Step 4: Calculate CEM weights
strata_weights = df_matched.groupby(['strata', 'treatment']).size().unstack()
df_matched['cem_weight'] = df_matched.apply(
    lambda row: 1 / strata_weights.loc[row['strata'], row['treatment']], axis=1
)

print(f"Matched: {len(df_matched)} of {len(df)} observations")
* Stata: Coarsened Exact Matching
* ssc install cem

* Basic CEM with automatic coarsening
cem age income education, treatment(treatment)

* Custom cutpoints for coarsening
cem age (#5) income (#4) education, treatment(treatment)
* #5 means 5 equal-sized bins

* Estimate treatment effect with CEM weights
reg outcome treatment age income education [pw=cem_weights]

* Check match quality
imb age income education, treatment(treatment)
# R: Coarsened Exact Matching with MatchIt
library(MatchIt)

# CEM matching
m.cem <- matchit(treatment ~ age + income + education,
                 data = df,
                 method = "cem")

# With custom cutpoints
m.cem <- matchit(treatment ~ age + income + education,
                 data = df,
                 method = "cem",
                 cutpoints = list(age = c(30, 45, 60),
                                  income = "q4"))  # quartiles

# Extract matched data with weights
matched_df <- match.data(m.cem)

# Estimate ATT using weights
att_model <- lm(outcome ~ treatment + age + income + education,
                data = matched_df,
                weights = weights)
summary(att_model)
Python Output
CEM Coarsening Summary: age: 4 bins (18-30, 30-45, 45-60, 60-100) income: 5 equal-frequency bins education: kept as-is (categorical) Strata Summary: Total strata: 48 Matched strata: 31 (64.6%) Unmatched strata: 17 Matched: 412 of 500 observations Treated matched: 198 of 247 (80.2%) Control matched: 214 of 253 (84.6%)
Stata Output
Coarsened exact matching Matching Summary: --------------------------------------------------------- | All Matched Unmatched -------------+------------------------------------------- Treated | 247 198 49 Untreated | 253 214 39 -------------+------------------------------------------- Total | 500 412 88 --------------------------------------------------------- Multivariate L1 distance: .28451 Univariate imbalance: L1 mean min 25% 50% 75% max -------------+------------------------------------------------------------------------- age .0423 .0124 -.0021 .0089 .0156 .0187 .0234 income .0387 .0098 -.0012 .0067 .0112 .0145 .0189 education .0156 .0045 -.0008 .0023 .0051 .0067 .0089
R Output
Call: matchit(formula = treatment ~ age + income + education, data = df, method = "cem", cutpoints = list(age = c(30, 45, 60), income = "q4")) Summary of Balance for Matched Data: Means Treated Means Control Std. Mean Diff. age 40.23 40.18 0.004 income 48234.12 48012.89 0.012 education 13.45 13.42 0.008 Sample Sizes: Control Treated All 253 247 Matched 214 198 Unmatched 39 49 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.8912 0.1234 31.534 <2e-16 *** treatment 0.3156 0.0823 3.836 0.00014 *** age 0.0123 0.0034 3.618 0.00032 *** income 0.0000 0.0000 2.145 0.03245 * education 0.0456 0.0189 2.413 0.01612 *

Nearest Neighbor Matching

Match on Mahalanobis distance (accounts for covariance) or Euclidean distance directly on covariates.

# Python: Mahalanobis Distance Matching
from scipy.spatial.distance import cdist
import numpy as np

covariates = ['age', 'income', 'education']
X_treated = df.loc[df['treatment'] == 1, covariates].values
X_control = df.loc[df['treatment'] == 0, covariates].values

# Calculate Mahalanobis distance matrix
cov_matrix = np.cov(df[covariates].values.T)
VI = np.linalg.inv(cov_matrix)
distances = cdist(X_treated, X_control, metric='mahalanobis', VI=VI)

# Find nearest neighbor for each treated
matches = np.argmin(distances, axis=1)

# Extract matched control outcomes
control_indices = df[df['treatment'] == 0].index
matched_outcomes = df.loc[control_indices[matches], 'outcome'].values

# ATT
att = df.loc[df['treatment'] == 1, 'outcome'].mean() - matched_outcomes.mean()
* Stata: Mahalanobis Distance Matching

* Using teffects nnmatch
teffects nnmatch (outcome age income education) (treatment), ///
    atet metric(maha) nn(1)

* With bias adjustment (Abadie-Imbens)
teffects nnmatch (outcome age income education) (treatment), ///
    atet metric(maha) nn(3) biasadj(age income education)

* Multiple matches with replacement
teffects nnmatch (outcome age income education) (treatment), ///
    atet metric(maha) nn(5)
# R: Mahalanobis Distance Matching
library(MatchIt)

# Mahalanobis distance matching
m.maha <- matchit(treatment ~ age + income + education,
                  data = df,
                  method = "nearest",
                  distance = "mahalanobis")

# Multiple matches (k:1)
m.maha <- matchit(treatment ~ age + income + education,
                  data = df,
                  method = "nearest",
                  distance = "mahalanobis",
                  ratio = 3)  # 3 controls per treated

# With caliper on propensity score
m.maha <- matchit(treatment ~ age + income + education,
                  data = df,
                  method = "nearest",
                  distance = "mahalanobis",
                  mahvars = ~ age + income + education,
                  caliper = 0.25,
                  std.caliper = TRUE)
Python Output
Covariance Matrix: age income education age 156.234 2341.56 12.45 income 2341.56 89234567.12 1234.56 education 12.45 1234.56 4.89 Mahalanobis Distance Summary: Min distance: 0.087 Mean distance: 0.456 Max distance: 2.341 Median distance: 0.389 Matching Results: Treated units: 247 Matched controls: 247 (with replacement: 189 unique) ATT (Mahalanobis): 0.328
Stata Output
Treatment-effects estimation Number of obs = 500 Estimator : nearest-neighbor matching Matches: requested = 1 Outcome model : matching min = 1 Distance metric: Mahalanobis max = 4 ------------------------------------------------------------------------------ | AI Robust outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- ATET | treatment | (1 vs 0) | .3276453 .0912345 3.59 0.000 .1488289 .5064617 ------------------------------------------------------------------------------ Matching with bias adjustment (nn=3): ------------------------------------------------------------------------------ | AI Robust outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- ATET | treatment | (1 vs 0) | .3312187 .0823456 4.02 0.000 .1698241 .4926133 ------------------------------------------------------------------------------
R Output
Call: matchit(formula = treatment ~ age + income + education, data = df, method = "nearest", distance = "mahalanobis", ratio = 3) Summary of Balance for All Data: Means Treated Means Control Std. Mean Diff. age 42.31 38.76 0.312 income 52847.23 45231.89 0.428 education 14.2 12.8 0.387 Summary of Balance for Matched Data: Means Treated Means Control Std. Mean Diff. age 42.31 42.08 0.020 income 52847.23 52134.56 0.040 education 14.2 14.1 0.028 Sample Sizes: Control Treated All 253 247 Matched (Ratio) 741 247 Unmatched 0 0 Effective Sample Sizes: Control Treated All 253.00 247 Matched 698.23 247

Balance Diagnostics

Always check that matching improved covariate balance. Report standardized mean differences (SMD) and variance ratios.

# Python: Balance Diagnostics
import matplotlib.pyplot as plt

def calc_smd(df, var, treatment_col):
    """Calculate standardized mean difference."""
    treated = df[df[treatment_col] == 1][var]
    control = df[df[treatment_col] == 0][var]
    pooled_std = np.sqrt((treated.var() + control.var()) / 2)
    return (treated.mean() - control.mean()) / pooled_std

# Before and after matching
covariates = ['age', 'income', 'education']
smd_before = [calc_smd(df, v, 'treatment') for v in covariates]
smd_after = [calc_smd(matched_df, v, 'treatment') for v in covariates]

# Love plot
fig, ax = plt.subplots(figsize=(8, 5))
y_pos = np.arange(len(covariates))
ax.scatter(smd_before, y_pos, marker='o', label='Before', s=100)
ax.scatter(smd_after, y_pos, marker='s', label='After', s=100)
ax.axvline(x=0.1, color='r', linestyle='--', alpha=0.5)
ax.axvline(x=-0.1, color='r', linestyle='--', alpha=0.5)
ax.set_yticks(y_pos)
ax.set_yticklabels(covariates)
ax.set_xlabel('Standardized Mean Difference')
ax.legend()
plt.tight_layout()
plt.show()
* Stata: Balance Diagnostics

* After psmatch2
pstest age income education, treated(treatment) both graph

* Standardized differences table
tebalance summarize

* Density plots of propensity scores
tebalance density

* Overidentification test for balance
tebalance overid
# R: Balance Diagnostics with cobalt
library(cobalt)

# Balance table
bal.tab(m.out, un = TRUE, thresholds = c(m = 0.1))

# Love plot (visual balance check)
love.plot(m.out,
          binary = "std",
          thresholds = c(m = 0.1),
          var.order = "unadjusted")

# Density plots
bal.plot(m.out, var.name = "age", which = "both")

# Full balance summary
summary(m.out)
Python Output
Balance Diagnostics ================== Standardized Mean Differences: Before After Threshold Variable Matching Matching (0.1) ----------- -------- -------- --------- age 0.312 0.019 PASS income 0.428 0.014 PASS education 0.387 0.028 PASS Variance Ratios: Before After Acceptable Variable Matching Matching (0.5-2.0) ----------- -------- -------- --------- age 1.342 1.023 PASS income 1.567 1.089 PASS education 1.234 0.987 PASS [Love Plot displayed showing improvement in balance]
Stata Output
. tebalance summarize Covariate balance summary Raw Matched ------------------------------------------------------------------ Number of obs = 500 476 Treated obs = 247 238 Control obs = 253 238 ------------------------------------------------------------------ Standardized differences Variance ratio -------------------------- -------------------------- Raw Matched Raw Matched ---------- ---------- ---------- ---------- age 0.312 0.019 1.342 1.023 income 0.428 0.014 1.567 1.089 education 0.387 0.028 1.234 0.987 . tebalance overid Overidentification test for covariate balance chi2(3) = 2.14 Prob > chi2 = 0.5436 Balance is adequate (fail to reject null of balance)
R Output
Call: bal.tab(m.out, un = TRUE, thresholds = c(m = 0.1)) Balance Measures Type Diff.Un Diff.Adj M.Threshold age Contin. 0.312 0.019 Balanced income Contin. 0.428 0.014 Balanced education Contin. 0.387 0.028 Balanced Balance tally for mean differences count Balanced, <0.1 3 Not Balanced, >0.1 0 Variable with most imbalance Variable Diff.Adj M.Threshold education 0.028 Balanced Sample sizes Control Treated All 253 247 Matched 238 238 Unmatched 15 9 [Love plot displayed with all variables within 0.1 threshold]

Interpreting Balance Diagnostics

Standardized Mean Difference (SMD): The difference in means between treated and control, divided by the pooled standard deviation. Rule of thumb: SMD < 0.1 indicates good balance; SMD < 0.25 is acceptable.

Variance Ratio: Ratio of variances between groups. Should be close to 1.0 (range 0.5-2.0 is acceptable).

What to report: Always show a balance table or Love plot comparing before/after matching. If balance is poor on key covariates, your matching has failed and you need to: (1) try different matching specifications, (2) include additional covariates, or (3) acknowledge the limitation.

Doubly Robust Estimation

Doubly robust methods (Robins et al., 1994) combine propensity score weighting with outcome regression. The estimator is consistent if either the propensity score model OR the outcome model is correctly specified—you get "two chances" to get it right.

Why doubly robust?

Standard matching relies entirely on the propensity score model. If it's misspecified, your estimate is biased. Doubly robust methods add an outcome model as insurance. With modern machine learning (SuperLearner, random forests), you can flexibly estimate both models without imposing restrictive functional forms.

# Python: Doubly Robust Estimation (AIPW)
from sklearn.linear_model import LogisticRegression, LinearRegression
import numpy as np

covariates = ['age', 'income', 'education']
X = df[covariates].values
D = df['treatment'].values
Y = df['outcome'].values

# Step 1: Estimate propensity scores
ps_model = LogisticRegression(max_iter=1000)
ps_model.fit(X, D)
pscore = ps_model.predict_proba(X)[:, 1]

# Step 2: Estimate outcome models
mu1_model = LinearRegression()
mu0_model = LinearRegression()
mu1_model.fit(X[D == 1], Y[D == 1])
mu0_model.fit(X[D == 0], Y[D == 0])

mu1 = mu1_model.predict(X)  # E[Y(1)|X]
mu0 = mu0_model.predict(X)  # E[Y(0)|X]

# Step 3: AIPW estimator
aipw_1 = mu1 + D * (Y - mu1) / pscore
aipw_0 = mu0 + (1 - D) * (Y - mu0) / (1 - pscore)
ate_aipw = np.mean(aipw_1 - aipw_0)

print(f"ATE (AIPW): {ate_aipw:.3f}")
* Stata: Doubly Robust with teffects

* AIPW (Augmented IPW)
teffects aipw (outcome age income education) ///
    (treatment age income education), atet

* IPWRA (IPW + Regression Adjustment)
teffects ipwra (outcome age income education) ///
    (treatment age income education), atet

* With different link functions
teffects aipw (outcome age income education, poisson) ///
    (treatment age income education, probit), atet
# R: Doubly Robust with AIPW package
library(AIPW)

# Basic AIPW
aipw_obj <- AIPW$new(
  Y = df$outcome,
  A = df$treatment,
  W = df[, c("age", "income", "education")],
  Q.SL.library = "SL.glm",   # outcome model
  g.SL.library = "SL.glm"    # propensity model
)

# Fit and summarize
aipw_obj$fit()
aipw_obj$summary()

# With SuperLearner (ML ensemble)
library(SuperLearner)
aipw_ml <- AIPW$new(
  Y = df$outcome,
  A = df$treatment,
  W = df[, c("age", "income", "education")],
  Q.SL.library = c("SL.glm", "SL.ranger", "SL.xgboost"),
  g.SL.library = c("SL.glm", "SL.ranger")
)
aipw_ml$fit()
aipw_ml$summary()
Python Output
Propensity Score Model: Logistic Regression, AUC: 0.78 Outcome Model (Treated): R-squared: 0.342 Coefficients: age=0.012, income=0.00001, education=0.089 Outcome Model (Control): R-squared: 0.318 Coefficients: age=0.010, income=0.00001, education=0.078 AIPW Estimates: E[Y(1)]: 4.587 (SE: 0.067) E[Y(0)]: 4.243 (SE: 0.058) ATE (AIPW): 0.344 Standard Error: 0.089 95% CI: [0.170, 0.518] p-value: 0.0001
Stata Output
Treatment-effects estimation Number of obs = 500 Estimator : augmented IPW Outcome model : linear Treatment model: logit ------------------------------------------------------------------------------ | Robust outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- ATET | treatment | (1 vs 0) | .3445672 .0879234 3.92 0.000 .1722404 .5168940 -------------+---------------------------------------------------------------- POmean | treatment | 0 | 4.243128 .0581234 73.01 0.000 4.129209 4.357047 ------------------------------------------------------------------------------ Note: Potential outcome means are computed at covariate means in the estimation sample. teffects ipwra: ------------------------------------------------------------------------------ | Robust outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- ATET | treatment | (1 vs 0) | .3412345 .0867123 3.94 0.000 .1712822 .5111868 ------------------------------------------------------------------------------
R Output
Estimate Std.Error 95%LCL 95%UCL Risk of Y(1) 4.5867 0.0672 4.4550 4.7184 Risk of Y(0) 4.2431 0.0584 4.1286 4.3576 Risk Difference 0.3436 0.0889 0.1694 0.5178 Risk Ratio 1.0810 0.0214 1.0390 1.1246 ATE: 0.3436 (SE: 0.0889) 95% Confidence Interval: [0.169, 0.518] With SuperLearner ensemble: Estimate Std.Error 95%LCL 95%UCL Risk Difference 0.3389 0.0856 0.1711 0.5067 Cross-validated risks: Outcome model: SL.ranger selected (CV-risk: 0.234) Propensity model: SL.glm selected (CV-risk: 0.312)
Key Assumption: Selection on Observables

All matching methods assume no unmeasured confounders—that once you condition on observed covariates, treatment assignment is as good as random. This is untestable. Use sensitivity analysis (e.g., Rosenbaum bounds) to assess robustness to hidden bias.

References
  • Iacus, S., King, G., & Porro, G. (2012). "Causal Inference without Balance Checking: Coarsened Exact Matching." Political Analysis.
  • Ho, D., Imai, K., King, G., & Stuart, E. (2007). "Matching as Nonparametric Preprocessing." Political Analysis.
  • Abadie, A. & Imbens, G. (2006). "Large Sample Properties of Matching Estimators." Econometrica.