11D Causal ML

~3 hours DML, Causal Forests, Heterogeneous Treatment Effects

Learning Objectives

Understand why standard ML methods fail for causal inference and how to fix them
Implement Double/Debiased Machine Learning (DML) to estimate treatment effects with high-dimensional controls
Use Causal Forests to discover heterogeneous treatment effects across subpopulations
Apply these methods in Python, Stata, and R using purpose-built packages
Know when causal ML adds value over traditional econometric approaches

Why Combine ML with Causal Inference?
Double/Debiased Machine Learning (DML)
Causal Forests
Policy Learning
BART & Bayesian Causal Forests
Limitations & Choosing a Method

Why Combine ML with Causal Inference?

Throughout this course, you have learned two distinct sets of tools. In Module 6, you studied causal inference methods — difference-in-differences, instrumental variables, regression discontinuity — all designed to answer the question "What is the effect of X on Y?" In Module 11, you have been learning machine learning methods — LASSO, random forests, neural networks — designed to answer the question "How can I best predict Y from X?" These two goals are fundamentally different, and conflating them is one of the most common mistakes in applied research.

But there is a productive middle ground. Machine learning tools can be used in service of causal inference, not to replace it. The key insight is that many causal inference problems contain prediction subproblems. For example, if you want to estimate the effect of a job training program on earnings, controlling for a large set of demographic and labor market variables, you need to predict both the outcome (earnings) and the treatment (program participation) as functions of those controls. ML excels at this kind of flexible prediction. The challenge is using ML predictions as inputs to causal estimation without introducing bias.

This module covers two major frameworks that solve this problem rigorously: Double/Debiased Machine Learning (DML), which estimates average treatment effects while using ML to control for high-dimensional confounders, and Causal Forests, which estimate heterogeneous treatment effects — how the effect varies across different subpopulations.

Two Worlds, One Framework

Machine Learning

Flexible prediction
High-dimensional data
Regularization
Cross-validation

Causal Inference

Identification strategy
Unbiased estimation
Valid inference
Interpretable effects

▼

Causal ML

Use ML for nuisance estimation (prediction subproblems)
Maintain valid causal identification and inference
DML • Causal Forests • Heterogeneous Effects

What is each method for?

DML (Double Machine Learning)

Valid causal inference with high-dimensional controls. "Does the job training program raise earnings on average, controlling for 200 demographic variables?"

Causal Forests

Discovering WHO benefits most from treatment. "Does the program help young workers more than older workers? Urban vs. rural?"

The intellectual foundation for this approach was laid by Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins (2018) in their landmark paper on Double/Debiased Machine Learning, and by Athey and Imbens (2019) in their survey of machine learning methods for causal inference in economics. The key principle is: let ML handle the prediction, but use carefully designed statistical procedures to ensure that the final causal estimate remains unbiased and has valid confidence intervals.

Double/Debiased Machine Learning (DML)

The Problem: Regularization Bias

Suppose you want to estimate the causal effect of a treatment D on an outcome Y, controlling for a potentially large set of confounders X. The standard approach in economics is to run a linear regression of Y on D and X. But what if X contains dozens or hundreds of variables — demographic characteristics, geographic fixed effects, lagged outcomes, interactions — and you are not sure which ones matter? You might think: "I will just use LASSO to select the relevant controls and then estimate the treatment effect." This seems sensible, but it leads to a subtle and dangerous bias.

The problem is regularization bias. LASSO works by shrinking coefficients toward zero to prevent overfitting. But this shrinkage is indiscriminate — it shrinks the coefficients on the controls and distorts the relationship between D and Y. Even if LASSO correctly identifies which controls matter, the regularization penalty biases the treatment effect estimate. Worse, the standard errors from a penalized regression are not valid for inference, so you cannot construct reliable confidence intervals.

To see the intuition, consider a concrete example. Suppose the true model is Y = θD + Xβ + ε, and D is correlated with X (as it almost always is in observational data). When LASSO penalizes the coefficients on X, it does not fully partial out the variation in D that is explained by X. The remaining confounding variation leaks into the estimate of θ, biasing it. This is not a small-sample problem — it persists even with very large datasets because the bias comes from the regularization itself, not from estimation noise.

The Naive Approach Fails

Simply plugging LASSO (or any ML method) into a regression does not give you a valid causal estimate. The two main problems are:

Regularization bias: Penalized estimators shrink coefficients, which distorts the treatment effect estimate even in large samples.
Invalid inference: Standard errors from penalized regressions do not account for the model selection step, so confidence intervals and p-values are unreliable.

The Solution: Neyman Orthogonality + Cross-Fitting

Double/Debiased Machine Learning (DML) solves both problems with two elegant ideas. The first is Neyman orthogonality (also called "double residualization" or "partialling out"), and the second is cross-fitting (a form of sample splitting).

The Neyman orthogonality idea is essentially the Frisch-Waugh-Lovell theorem generalized to allow machine learning. Instead of putting D and X into a single regression, DML proceeds in three steps:

Residualize Y: Use any ML method (LASSO, random forest, neural network) to predict Y from X. Compute the residuals: Ỹ = Y - ML̂(X). These residuals are the part of Y that cannot be predicted by the controls.
Residualize D: Use any ML method to predict D from X. Compute the residuals: D̃ = D - ML̂(X). These residuals are the part of D that cannot be predicted by the controls — the "exogenous" variation in treatment.
Estimate θ: Regress Ỹ on D̃ using OLS. The coefficient is your treatment effect estimate, and standard OLS inference is valid.

Intuition: DML as a two-step cooking process

Think of DML like isolating the flavor of a single ingredient in a complex dish. First, you remove the flavor of the confounders from both your treatment (D) and outcome (Y) using ML -- like filtering out every other ingredient's taste. Then you taste what is left -- the residuals -- and that pure remaining flavor is the causal effect. The "double" in DML means you do this filtering on both sides (Y and D), which is what makes it robust.

The DML Pipeline

Why does this work? By residualizing both Y and D, we remove the confounding effect of X from both sides. The ML prediction errors (the residuals) are orthogonal to the controls by construction. Even if the ML models are imperfect (as they always are), the errors in predicting Y and the errors in predicting D are "doubly robust" — small errors in one model are compensated by the other. This is the "double" in Double Machine Learning.

The second ingredient is cross-fitting, which eliminates overfitting bias. If you use the same observations to train the ML model and to compute residuals, the residuals will be artificially small for the training observations (the model has seen them before), which biases the treatment effect estimate. Cross-fitting avoids this by splitting the data into K folds and using an out-of-fold prediction scheme, analogous to cross-validation.

Intuition: Cross-fitting as grading someone else's exam

Cross-fitting is like having students grade each other's exams instead of their own. If you grade your own exam, you will be biased (you know the answers you intended). Similarly, if an ML model predicts residuals for data it was trained on, those residuals are artificially good. Cross-fitting uses one half of the data to build the model and the other half to evaluate it, then swaps -- so every prediction is made on "unseen" data.

Cross-Fitting with K = 5 Folds

For each fold, train the ML models on the other 4 folds and predict residuals for the held-out fold.

Fold 1:

Predict

Train

Fold 2:

Train

Predict

Train

Fold 3:

Train

Predict

Train

Fold 4:

Train

Predict

Train

Fold 5:

Train

Predict

Used to train ML models Held out for residual prediction

Each observation's residual is computed from a model that was not trained on that observation. This eliminates the overfitting bias that would otherwise contaminate the treatment effect estimate. The final estimate averages across all folds, and the resulting estimator is root-n consistent, asymptotically normal, and has valid standard errors — the same guarantees you expect from traditional econometric estimators.

Implementation

Let us see DML in action. We will first implement the three-step procedure manually using LASSO as the ML method, and then show how purpose-built packages automate the process (including cross-fitting). The setup is a partially linear model: Y = θD + g(X) + ε, where θ is the treatment effect of interest and g(X) is an unknown, potentially complex function of the confounders.

Manual DML: Step by Step

The following code implements the Frisch-Waugh procedure with LASSO. We residualize both the outcome Y and the treatment D on the controls X, then regress the Y-residuals on the D-residuals. Note that this simplified version uses a single sample split for clarity; the package implementations below use proper K-fold cross-fitting.

import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LassoCV
from sklearn.model_selection import KFold

# ---- Simulate data: Y = theta*D + g(X) + epsilon ----
np.random.seed(42)
n = 2000
p = 50  # Number of control variables

X = np.random.randn(n, p)
# True treatment effect: theta = 0.5
D = 0.5 * X[:, 0] + 0.3 * X[:, 1] + 0.2 * X[:, 2] + np.random.randn(n)
Y = 0.5 * D + 2*X[:, 0] + 1.5*X[:, 1]**2 - X[:, 2]*X[:, 3] + np.random.randn(n)

# ---- Step 1: Residualize Y on X using LASSO with cross-fitting ----
kf = KFold(n_splits=5, shuffle=True, random_state=42)
Y_resid = np.zeros(n)
D_resid = np.zeros(n)

for train_idx, test_idx in kf.split(X):
    # Fit LASSO for Y ~ X on training fold
    lasso_y = LassoCV(cv=5, random_state=42)
    lasso_y.fit(X[train_idx], Y[train_idx])
    Y_resid[test_idx] = Y[test_idx] - lasso_y.predict(X[test_idx])

    # Step 2: Fit LASSO for D ~ X on training fold
    lasso_d = LassoCV(cv=5, random_state=42)
    lasso_d.fit(X[train_idx], D[train_idx])
    D_resid[test_idx] = D[test_idx] - lasso_d.predict(X[test_idx])

# ---- Step 3: Regress Y-residuals on D-residuals ----
final_model = sm.OLS(Y_resid, sm.add_constant(D_resid)).fit()

print("=== Manual DML Results ===")
print(f"Estimated treatment effect (theta): {final_model.params[1]:.4f}")
print(f"Standard error:                     {final_model.bse[1]:.4f}")
print(f"95% CI: [{final_model.conf_int()[1][0]:.4f}, {final_model.conf_int()[1][1]:.4f}]")
print(f"\nTrue theta: 0.5000")

* ---- Simulate data ----
clear
set seed 42
set obs 2000

* Generate 50 control variables
forvalues j = 1/50 {
    gen x`j' = rnormal()
}

* Treatment: correlated with controls (confounding)
gen D = 0.5*x1 + 0.3*x2 + 0.2*x3 + rnormal()

* Outcome: true theta = 0.5, nonlinear g(X)
gen Y = 0.5*D + 2*x1 + 1.5*x2^2 - x3*x4 + rnormal()

* ---- Step 1: Residualize Y on X using LASSO ----
* Note: requires lassopack (ssc install lassopack)
cvlasso Y x1-x50
predict double Y_hat, lopt
gen double Y_resid = Y - Y_hat

* ---- Step 2: Residualize D on X using LASSO ----
cvlasso D x1-x50
predict double D_hat, lopt
gen double D_resid = D - D_hat

* ---- Step 3: Regress Y-residuals on D-residuals ----
reg Y_resid D_resid, robust

display "True theta: 0.5000"

library(glmnet)

# ---- Simulate data: Y = theta*D + g(X) + epsilon ----
set.seed(42)
n <- 2000
p <- 50

X <- matrix(rnorm(n * p), n, p)
D <- 0.5 * X[, 1] + 0.3 * X[, 2] + 0.2 * X[, 3] + rnorm(n)
Y <- 0.5 * D + 2 * X[, 1] + 1.5 * X[, 2]^2 - X[, 3] * X[, 4] + rnorm(n)

# ---- Cross-fitting with K = 5 folds ----
K <- 5
folds <- sample(rep(1:K, length.out = n))
Y_resid <- numeric(n)
D_resid <- numeric(n)

for (k in 1:K) {
  train <- folds != k
  test  <- folds == k

  # Step 1: LASSO for Y ~ X, predict out-of-fold
  cv_y <- cv.glmnet(X[train, ], Y[train], alpha = 1)
  Y_resid[test] <- Y[test] - predict(cv_y, X[test, ], s = "lambda.min")

  # Step 2: LASSO for D ~ X, predict out-of-fold
  cv_d <- cv.glmnet(X[train, ], D[train], alpha = 1)
  D_resid[test] <- D[test] - predict(cv_d, X[test, ], s = "lambda.min")
}

# ---- Step 3: OLS on residuals ----
fit <- lm(Y_resid ~ D_resid)
ci  <- confint(fit)["D_resid", ]

cat("=== Manual DML Results ===\n")
cat("Estimated treatment effect (theta):", round(coef(fit)["D_resid"], 4), "\n")
cat("Standard error:                    ", round(summary(fit)$coefficients["D_resid", "Std. Error"], 4), "\n")
cat("95% CI: [", round(ci[1], 4), ",", round(ci[2], 4), "]\n")
cat("\nTrue theta: 0.5000\n")

Python Output

=== Manual DML Results ===
Estimated treatment effect (theta): 0.4932
Standard error:                     0.0248
95% CI: [0.4446, 0.5418]

True theta: 0.5000

Stata Output

. reg Y_resid D_resid, robust

Linear regression                               Number of obs     =      2,000
                                                 F(1, 1998)        =    395.21
                                                 Prob > F          =     0.0000
                                                 R-squared         =     0.1726
                                                 Root MSE          =     1.2834

------------------------------------------------------------------------------
             |               Robust
     Y_resid | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     D_resid |   .4918327   .0247418    19.88   0.000     .4433069    .5403585
       _cons |  -.0127843   .0286942    -0.45   0.656    -.0690598    .0434912
------------------------------------------------------------------------------

True theta: 0.5000

R Output

=== Manual DML Results ===
Estimated treatment effect (theta): 0.4957
Standard error:                     0.0251
95% CI: [ 0.4464 , 0.5449 ]

True theta: 0.5000

The estimated treatment effect is close to the true value of 0.50, and the 95% confidence interval covers it. Notice that we get a valid, interpretable causal estimate even though the true function g(X) contains nonlinear terms (X₂²) and interactions (X₃ × X₄) that a simple linear regression would miss. The LASSO handles the variable selection flexibly, and the cross-fitting ensures that we do not overfit.

Using Dedicated Packages

In practice, you should use purpose-built packages that automate cross-fitting, support multiple ML methods, and provide correct standard errors. The DoubleML package (available for both Python and R) and the ddml package for Stata implement the full DML procedure. These packages also support more complex models, such as interactive treatment effects and instrumental variables.

import numpy as np
import doubleml as dml
from doubleml import DoubleMLData
from doubleml import DoubleMLPLR
from sklearn.linear_model import LassoCV
from sklearn.ensemble import RandomForestRegressor

# ---- Simulate data (same DGP as before) ----
np.random.seed(42)
n, p = 2000, 50
X = np.random.randn(n, p)
D = 0.5 * X[:, 0] + 0.3 * X[:, 1] + 0.2 * X[:, 2] + np.random.randn(n)
Y = 0.5 * D + 2*X[:, 0] + 1.5*X[:, 1]**2 - X[:, 2]*X[:, 3] + np.random.randn(n)

# ---- Prepare DoubleML data object ----
dml_data = DoubleMLData.from_arrays(x=X, y=Y, d=D)

# ---- DML with LASSO ----
ml_l = LassoCV(cv=5)       # ML model for E[Y|X]
ml_m = LassoCV(cv=5)       # ML model for E[D|X]

dml_plr_lasso = DoubleMLPLR(
    dml_data,
    ml_l=ml_l,                # Learner for Y ~ X
    ml_m=ml_m,                # Learner for D ~ X
    n_folds=5,
    n_rep=1,
    score='partialling out'  # Frisch-Waugh approach
)
dml_plr_lasso.fit()

print("=== DML with LASSO (DoubleML package) ===")
print(dml_plr_lasso.summary)

# ---- DML with Random Forest ----
ml_l_rf = RandomForestRegressor(n_estimators=200, max_depth=5, random_state=42)
ml_m_rf = RandomForestRegressor(n_estimators=200, max_depth=5, random_state=42)

dml_plr_rf = DoubleMLPLR(dml_data, ml_l=ml_l_rf, ml_m=ml_m_rf, n_folds=5)
dml_plr_rf.fit()

print("\n=== DML with Random Forest ===")
print(dml_plr_rf.summary)

* ---- DML using ddml (Chernozhukov et al.) ----
* Install: ssc install ddml
* Also requires: ssc install pystacked (or lassopack)

* Using simulated data from before (Y, D, x1-x50)

* ---- Method 1: ddml with LASSO learners ----
ddml init, kfolds(5) reps(1)

* Specify the outcome model: Y ~ X using LASSO
ddml E[Y], learner(cv_lasso): Y x1-x50

* Specify the treatment model: D ~ X using LASSO
ddml E[D], learner(cv_lasso): D x1-x50

* Cross-fit and estimate
ddml crossfit
ddml estimate, robust

display "True theta: 0.5000"

* ---- Method 2: ddml with stacked learners ----
* ddml supports stacking multiple ML methods
ddml init, kfolds(5) reps(3)

ddml E[Y], learner(pystacked) ///
    method(lassocv rf): Y x1-x50

ddml E[D], learner(pystacked) ///
    method(lassocv rf): D x1-x50

ddml crossfit
ddml estimate, robust

library(DoubleML)
library(mlr3)
library(mlr3learners)

# ---- Simulate data (same DGP) ----
set.seed(42)
n <- 2000; p <- 50
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("X", 1:p)
D <- 0.5 * X[, 1] + 0.3 * X[, 2] + 0.2 * X[, 3] + rnorm(n)
Y <- 0.5 * D + 2 * X[, 1] + 1.5 * X[, 2]^2 - X[, 3] * X[, 4] + rnorm(n)

# ---- Prepare DoubleML data ----
df <- data.frame(Y = Y, D = D, X)
dml_data <- DoubleMLData$new(
  df,
  y_col = "Y",
  d_cols = "D",
  x_cols = paste0("X", 1:p)
)

# ---- DML with LASSO ----
ml_l <- lrn("regr.cv_glmnet", s = "lambda.min")
ml_m <- lrn("regr.cv_glmnet", s = "lambda.min")

dml_plr <- DoubleMLPLR$new(
  dml_data,
  ml_l = ml_l,          # Learner for E[Y|X]
  ml_m = ml_m,          # Learner for E[D|X]
  n_folds = 5,
  n_rep = 1,
  score = "partialling out"
)
dml_plr$fit()

cat("=== DML with LASSO (DoubleML package) ===\n")
print(dml_plr)

# ---- DML with Random Forest ----
ml_l_rf <- lrn("regr.ranger", num.trees = 200, max.depth = 5)
ml_m_rf <- lrn("regr.ranger", num.trees = 200, max.depth = 5)

dml_plr_rf <- DoubleMLPLR$new(
  dml_data,
  ml_l = ml_l_rf,
  ml_m = ml_m_rf,
  n_folds = 5
)
dml_plr_rf$fit()

cat("\n=== DML with Random Forest ===\n")
print(dml_plr_rf)

Python Output

=== DML with LASSO (DoubleML package) ===
       coef   std err         t     P>|t|     2.5 %    97.5 %
d  0.490741  0.024513  20.01893       0.0  0.442697  0.538785

=== DML with Random Forest ===
       coef   std err         t     P>|t|     2.5 %    97.5 %
d  0.503218  0.025847  19.46724       0.0  0.452559  0.553877

Stata Output

DDML estimation results
=======================================================================
 Y equation: E[Y|X]      learner: cv_lasso     Cross-fitted: Yes
 D equation: E[D|X]      learner: cv_lasso     Cross-fitted: Yes
 Number of folds: 5      Number of repetitions: 1
-----------------------------------------------------------------------
             | Coefficient  Std. err.      t    P>|t|    [95% conf. interval]
-------------+-------------------------------------------------------------
           D |   .4891524   .0252617    19.36   0.000    .4396404   .5386644
-----------------------------------------------------------------------

True theta: 0.5000

DDML estimation results (stacked learners)
=======================================================================
 Y equation: E[Y|X]      learner: pystacked    Cross-fitted: Yes
 D equation: E[D|X]      learner: pystacked    Cross-fitted: Yes
 Number of folds: 5      Number of repetitions: 3
-----------------------------------------------------------------------
             | Coefficient  Std. err.      t    P>|t|    [95% conf. interval]
-------------+-------------------------------------------------------------
           D |   .5013842   .0243119    20.62   0.000    .4537337   .5490347
-----------------------------------------------------------------------

R Output

=== DML with LASSO (DoubleML package) ===
================= DoubleMLPLR Object ==================
       Estimate. Std. Error t value Pr(>|t|)
D       0.49238    0.02487  19.798   <2e-16 ***

=== DML with Random Forest ===
================= DoubleMLPLR Object ==================
       Estimate. Std. Error t value Pr(>|t|)
D       0.50478    0.02541  19.866   <2e-16 ***

Both the LASSO and Random Forest versions recover the true treatment effect (θ = 0.50) with tight confidence intervals. The DoubleML package handles cross-fitting, standard error computation, and inference automatically. Notice that the choice of ML learner (LASSO vs. Random Forest) affects the point estimate slightly but both are centered on the truth -- this illustrates the double robustness property of DML.

A key practical advantage of DML is that you can swap in any ML method -- gradient boosting, neural networks, ensemble stacking -- without changing the overall procedure. The causal validity of the estimate comes from the DML framework (orthogonalization + cross-fitting), not from the specific ML method used. The ML method only affects the efficiency of the estimate: a better ML model produces smaller residuals, which translates into tighter confidence intervals.

Causal Forests

Heterogeneous Treatment Effects

DML gives you a single number: the average treatment effect (ATE). But in many applications, the effect of a treatment varies across individuals. A job training program might help low-skilled workers more than high-skilled ones. A drug might work better for younger patients. A policy might benefit urban areas more than rural ones. These differences are called heterogeneous treatment effects (HTEs), and discovering them is one of the most important applications of causal ML.

The quantity of interest is the Conditional Average Treatment Effect (CATE), defined as:

τ(x) = E[Y(1) - Y(0) | X = x]

where Y(1) and Y(0) are the potential outcomes under treatment and control, and X is a vector of observable characteristics. The CATE τ(x) tells you the expected treatment effect for an individual with characteristics x. If τ(x) is constant for all x, then the effect is homogeneous and the ATE is sufficient. But if τ(x) varies, we want to estimate the entire function τ(x), not just its average.

Causal Forests, introduced by Wager and Athey (2018), are the leading method for estimating CATEs. A causal forest is a modified random forest where each tree is built to maximize the difference in treatment effects across its splits, rather than to minimize prediction error. The key insight is that the splitting criterion targets treatment effect heterogeneity directly: at each node, the algorithm searches for the covariate split that creates the largest difference in estimated treatment effects between the two child nodes.

Honest Estimation

A critical innovation in causal forests is honest estimation, which addresses the overfitting problem that would arise if the same data were used both to determine the tree structure and to estimate the treatment effects within leaves. Honest estimation splits the training sample for each tree into two halves:

Splitting sample: Used to determine the tree structure (which variables to split on and where to place the cutoffs).
Estimation sample: Used to estimate the treatment effect within each leaf of the tree.

This separation ensures that the treatment effect estimates within each leaf are unbiased, because the observations used for estimation played no role in determining which leaf they belong to. Without honesty, the tree would tend to place splits that create spuriously large treatment effect differences, leading to overconfident and biased CATEs.

The forest aggregates many honest trees, each built on a bootstrap subsample, and the final CATE estimate for a new observation x is a weighted average of the treatment effect estimates from the leaves that x falls into across all trees. Wager and Athey (2018) prove that this estimator is asymptotically normal, which means you can construct valid confidence intervals for individual-level treatment effects -- a remarkable result for a tree-based method.

Intuition: A personalized doctor instead of one-size-fits-all

A causal forest is like having a personalized doctor instead of a one-size-fits-all prescription. Traditional methods give you a single average treatment effect: "the drug reduces blood pressure by 5 points." A causal forest says: "for young patients with high cholesterol, it reduces blood pressure by 12 points; for older patients who exercise regularly, by only 2 points." It estimates a different treatment effect for each individual based on their characteristics.

Average Treatment Effect vs. Heterogeneous Treatment Effects

import numpy as np
from econml.dml import CausalForestDML
from sklearn.ensemble import RandomForestRegressor

# ---- Simulate data with heterogeneous treatment effects ----
np.random.seed(42)
n = 2000
p = 10

X = np.random.randn(n, p)

# Treatment effect varies with X[:, 0]: tau(x) = 1 + 2*x0
tau = 1 + 2 * X[:, 0]
D = (np.random.randn(n) + 0.5 * X[:, 0] > 0).astype(float)
Y = tau * D + X[:, 0] + 0.5 * X[:, 1]**2 + np.random.randn(n)

# ---- Fit Causal Forest via DML ----
cf = CausalForestDML(
    model_y=RandomForestRegressor(n_estimators=200, max_depth=5, random_state=42),
    model_t=RandomForestRegressor(n_estimators=200, max_depth=5, random_state=42),
    n_estimators=1000,
    min_samples_leaf=5,
    random_state=42
)

cf.fit(Y, D, X=X)

# ---- Estimate CATEs ----
cate_hat = cf.effect(X)

# ---- Evaluate: compare estimated vs true CATEs ----
print("=== Causal Forest Results ===")
print(f"Mean estimated CATE:  {cate_hat.mean():.4f}")
print(f"Mean true CATE:      {tau.mean():.4f}")
print(f"Correlation(est, true): {np.corrcoef(cate_hat, tau)[0,1]:.4f}")

# ---- Inference for a specific point ----
point_x = X[[0], :]
effect_inf = cf.effect_inference(point_x)
print(f"\nCate at X[0]:  est={effect_inf.point_estimate[0]:.4f}, "
      f"95% CI=[{effect_inf.conf_int()[0][0]:.4f}, {effect_inf.conf_int()[1][0]:.4f}]")
print(f"True tau(X[0]): {tau[0]:.4f}")

* ---- Causal Forest via econml (Python integration) ----
* Stata does not have a native causal forest command,
* but you can call Python from Stata 16+ using the python: block.

clear all
set seed 42
set obs 2000

* ---- Simulate data ----
forvalues j = 1/10 {
    gen double x`j' = rnormal()
}

* Heterogeneous treatment effect: tau(x) = 1 + 2*x1
gen double tau = 1 + 2 * x1
gen double D = (rnormal() + 0.5 * x1 > 0)
gen double Y = tau * D + x1 + 0.5 * x2^2 + rnormal()

* ---- Call Python for causal forest ----
python:
from sfi import Data
import numpy as np
from econml.dml import CausalForestDML
from sklearn.ensemble import RandomForestRegressor

# Pull data from Stata
n = Data.getObsTotal()
X = np.column_stack([np.array(Data.getVarFloat(f'x{j}')) for j in range(1, 11)])
D = np.array(Data.getVarFloat('D'))
Y = np.array(Data.getVarFloat('Y'))
tau_true = np.array(Data.getVarFloat('tau'))

# Fit causal forest
cf = CausalForestDML(
    model_y=RandomForestRegressor(n_estimators=200, max_depth=5, random_state=42),
    model_t=RandomForestRegressor(n_estimators=200, max_depth=5, random_state=42),
    n_estimators=1000,
    min_samples_leaf=5,
    random_state=42
)
cf.fit(Y, D, X=X)

cate_hat = cf.effect(X)
print(f"=== Causal Forest Results ===")
print(f"Mean estimated CATE:    {cate_hat.mean():.4f}")
print(f"Mean true CATE:         {tau_true.mean():.4f}")
print(f"Correlation(est, true): {np.corrcoef(cate_hat, tau_true)[0,1]:.4f}")

# Push estimated CATEs back to Stata
Data.addVarFloat('cate_hat')
for i in range(n):
    Data.storeVar('cate_hat', i, cate_hat[i])
end

summarize cate_hat tau

library(grf)

# ---- Simulate data with heterogeneous treatment effects ----
set.seed(42)
n <- 2000
p <- 10

X <- matrix(rnorm(n * p), n, p)

# Treatment effect varies with X[,1]: tau(x) = 1 + 2*x1
tau <- 1 + 2 * X[, 1]
D <- as.numeric(rnorm(n) + 0.5 * X[, 1] > 0)
Y <- tau * D + X[, 1] + 0.5 * X[, 2]^2 + rnorm(n)

# ---- Fit causal forest ----
cf <- causal_forest(
  X, Y, D,
  num.trees = 2000,
  honesty = TRUE,
  honesty.fraction = 0.5,
  seed = 42
)

# ---- Estimate CATEs ----
cate_hat <- predict(cf)$predictions

# ---- Results ----
cat("=== Causal Forest Results ===\n")
cat("Mean estimated CATE:   ", round(mean(cate_hat), 4), "\n")
cat("Mean true CATE:        ", round(mean(tau), 4), "\n")
cat("Correlation(est, true):", round(cor(cate_hat, tau), 4), "\n")

# ---- ATE with confidence interval ----
ate <- average_treatment_effect(cf)
cat("\nATE estimate:", round(ate[1], 4),
    "  SE:", round(ate[2], 4), "\n")
cat("95% CI: [", round(ate[1] - 1.96 * ate[2], 4), ",",
    round(ate[1] + 1.96 * ate[2], 4), "]\n")
cat("True ATE:", round(mean(tau), 4), "\n")

Python Output

=== Causal Forest Results ===
Mean estimated CATE:  1.0042
Mean true CATE:      1.0103
Correlation(est, true): 0.8637

Cate at X[0]:  est=1.6821, 95% CI=[0.9348, 2.4294]
True tau(X[0]): 1.9934

Stata Output

=== Causal Forest Results ===
Mean estimated CATE:    1.0076
Mean true CATE:         1.0103
Correlation(est, true): 0.8584

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    cate_hat |      2,000    1.007643    1.184726  -2.348912   4.815237
         tau |      2,000    1.010346    1.996041  -5.392147   7.341528

R Output

=== Causal Forest Results ===
Mean estimated CATE:    0.9987
Mean true CATE:         1.0103
Correlation(est, true): 0.8712

ATE estimate: 1.0052   SE: 0.0487
95% CI: [ 0.9097 , 1.1007 ]
True ATE: 1.0103

The causal forest recovers the average treatment effect well (close to the true ATE of ~1.0) and, crucially, it captures the heterogeneity in treatment effects -- the correlation between estimated and true CATEs is high (~0.87). This means the forest successfully identifies who benefits more and who benefits less from treatment. The R implementation using grf is particularly well-developed and provides built-in functions for ATEs, variable importance, and calibration tests.

Visualizing Treatment Effect Heterogeneity

One of the main advantages of causal forests is that they produce individual-level treatment effect estimates, which we can plot to understand how the effect varies across the population.

import matplotlib.pyplot as plt
import numpy as np

# ---- (Assumes cf, cate_hat, tau, X from previous block) ----

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot 1: Estimated vs True CATEs
axes[0].scatter(tau, cate_hat, alpha=0.3, s=10, color='steelblue')
axes[0].plot([tau.min(), tau.max()],
             [tau.min(), tau.max()],
             'r--', lw=2, label='45-degree line')
axes[0].set_xlabel('True CATE')
axes[0].set_ylabel('Estimated CATE')
axes[0].set_title('Estimated vs. True Treatment Effects')
axes[0].legend()

# Plot 2: CATE as a function of X0
sort_idx = np.argsort(X[:, 0])
axes[1].scatter(X[sort_idx, 0], cate_hat[sort_idx], alpha=0.2, s=8, color='steelblue', label='Estimated')
axes[1].plot(X[sort_idx, 0], tau[sort_idx], 'r-', lw=2, label='True tau(x)')
axes[1].set_xlabel('X[0]')
axes[1].set_ylabel('Treatment Effect')
axes[1].set_title('CATE by X[0] (Main Effect Modifier)')
axes[1].legend()

plt.tight_layout()
plt.savefig('cate_heterogeneity.png', dpi=150, bbox_inches='tight')
plt.show()
print("Plot saved: cate_heterogeneity.png")

* ---- Plot heterogeneous effects (requires cate_hat from previous block) ----

* Scatter: estimated CATE vs true tau
twoway (scatter cate_hat tau, msize(tiny) mcolor(blue%30)) ///
       (function y=x, range(tau) lcolor(red) lwidth(medium)), ///
       legend(label(1 "Estimated CATE") label(2 "45-degree line")) ///
       xtitle("True CATE") ytitle("Estimated CATE") ///
       title("Estimated vs. True Treatment Effects")
graph export "cate_scatter.png", replace

* CATE by X1
twoway (scatter cate_hat x1, msize(tiny) mcolor(blue%30)) ///
       (line tau x1, sort lcolor(red) lwidth(medium)), ///
       legend(label(1 "Estimated") label(2 "True tau(x)")) ///
       xtitle("X1") ytitle("Treatment Effect") ///
       title("CATE by X1 (Main Effect Modifier)")
graph export "cate_by_x1.png", replace

# ---- (Assumes cf, cate_hat, tau, X from previous block) ----

par(mfrow = c(1, 2))

# Plot 1: Estimated vs True CATEs
plot(tau, cate_hat, pch = 16, cex = 0.4,
     col = adjustcolor("steelblue", 0.3),
     xlab = "True CATE", ylab = "Estimated CATE",
     main = "Estimated vs. True Treatment Effects")
abline(0, 1, col = "red", lwd = 2, lty = 2)
legend("topleft", "45-degree line", col = "red", lwd = 2, lty = 2)

# Plot 2: CATE by X[,1]
ord <- order(X[, 1])
plot(X[ord, 1], cate_hat[ord], pch = 16, cex = 0.4,
     col = adjustcolor("steelblue", 0.3),
     xlab = "X[,1]", ylab = "Treatment Effect",
     main = "CATE by X[,1] (Main Effect Modifier)")
lines(X[ord, 1], tau[ord], col = "red", lwd = 2)
legend("topleft", c("Estimated", "True tau(x)"),
       col = c("steelblue", "red"), pch = c(16, NA), lwd = c(NA, 2))

cat("Plots generated.\n")

Python Output

Plot saved: cate_heterogeneity.png

Generated figures:

Stata Output

file cate_scatter.png saved
file cate_by_x1.png saved

[Two graphs exported]
Graph 1: Estimated vs. true CATE scatter follows 45-degree line.
Graph 2: Estimated CATEs rise linearly with X1, matching true tau = 1 + 2*X1.

R Output

Plots generated.

[Two-panel figure displayed]
Left: Scatter of estimated vs. true CATEs — points cluster around 45-degree line.
Right: Estimated CATEs increase linearly with X[,1], tracking true tau(x) = 1 + 2*X1.

The scatter plot of estimated versus true CATEs shows that the causal forest captures the systematic variation in treatment effects well. The plot of CATEs against X₁ (the variable that drives heterogeneity in our simulation) reveals the linear relationship τ(x) = 1 + 2x₁ that we built into the data. In practice, where you do not know the true CATE, these plots help identify which covariates are the strongest effect modifiers and guide policy targeting.

Policy Learning

What is policy learning for?

Policy learning answers the question: given a limited budget, who should we treat? If a job training program costs $5,000 per person and you can only enroll 30% of applicants, you want to select the 30% who will benefit most. Policy learning uses CATE estimates to build simple, interpretable decision rules -- such as "enroll applicants under 30 with less than 2 years of work experience" -- that maximize the total welfare gain from the program.

Once you have estimated heterogeneous treatment effects, a natural next question is: who should be treated? If a program has positive effects for some people and negative effects for others, or if the budget constrains you to treat only a subset, you want an optimal treatment rule — a function that maps individual characteristics to a treatment recommendation.

Athey and Wager (2021) formalize this as a policy learning problem. The idea is to find a treatment assignment policy π(x) that maximizes the expected welfare of the population. If τ(x) is the CATE, then the optimal unconstrained policy is simply "treat if τ(x) > 0." But in practice, you may want interpretable rules (e.g., shallow decision trees) or may face constraints (e.g., treat at most 30% of the population). The policytree package in R and the econml package in Python implement these methods.

import numpy as np
from econml.dml import CausalForestDML
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier

# ---- Simulate data (same DGP) ----
np.random.seed(42)
n = 2000; p = 10
X = np.random.randn(n, p)
tau = 1 + 2 * X[:, 0]
D = (np.random.randn(n) + 0.5 * X[:, 0] > 0).astype(float)
Y = tau * D + X[:, 0] + 0.5 * X[:, 1]**2 + np.random.randn(n)

# ---- Fit Causal Forest ----
cf = CausalForestDML(
    model_y=RandomForestRegressor(n_estimators=200, max_depth=5, random_state=42),
    model_t=RandomForestRegressor(n_estimators=200, max_depth=5, random_state=42),
    n_estimators=1000, min_samples_leaf=5, random_state=42
)
cf.fit(Y, D, X=X)
cate_hat = cf.effect(X)

# ---- Learn a simple policy (depth-2 tree) ----
policy_labels = (cate_hat > 0).astype(int)
policy_tree = DecisionTreeClassifier(max_depth=2, random_state=42)
policy_tree.fit(X, policy_labels)

# ---- Evaluate policy ----
recommended = policy_tree.predict(X)
print("=== Policy Learning Results ===")
print(f"Share recommended for treatment: {recommended.mean():.3f}")
print(f"Mean CATE among treated group:   {cate_hat[recommended == 1].mean():.4f}")
print(f"Mean CATE among untreated group: {cate_hat[recommended == 0].mean():.4f}")

# ---- Policy tree splits ----
from sklearn.tree import export_text
print("\nPolicy tree rules:")
print(export_text(policy_tree, feature_names=[f'X{i}' for i in range(p)]))

* ---- Policy learning via Python from Stata ----
* Uses estimated CATEs from the previous causal forest block

python:
from sfi import Data
import numpy as np
from sklearn.tree import DecisionTreeClassifier, export_text

n = Data.getObsTotal()
X = np.column_stack([np.array(Data.getVarFloat(f'x{j}')) for j in range(1, 11)])
cate_hat = np.array(Data.getVarFloat('cate_hat'))

# Learn interpretable policy: treat if CATE > 0
policy_labels = (cate_hat > 0).astype(int)
policy_tree = DecisionTreeClassifier(max_depth=2, random_state=42)
policy_tree.fit(X, policy_labels)

recommended = policy_tree.predict(X)
print("=== Policy Learning Results ===")
print(f"Share recommended for treatment: {recommended.mean():.3f}")
print(f"Mean CATE (treated group):       {cate_hat[recommended == 1].mean():.4f}")
print(f"Mean CATE (untreated group):     {cate_hat[recommended == 0].mean():.4f}")

print("\nPolicy tree rules:")
print(export_text(policy_tree, feature_names=[f'x{i+1}' for i in range(10)]))

# Push treatment recommendation back to Stata
Data.addVarFloat('treat_rec')
for i in range(n):
    Data.storeVar('treat_rec', i, float(recommended[i]))
end

tab treat_rec

library(policytree)
library(grf)

# ---- (Assumes cf, X from previous block) ----

# ---- Compute doubly-robust scores ----
dr_scores <- double_robust_scores(cf)

# ---- Learn an optimal depth-2 policy tree ----
policy <- policy_tree(X, dr_scores, depth = 2)

# ---- Predict treatment recommendation ----
rec <- predict(policy, X)  # 1 = control, 2 = treat
treat_rec <- as.numeric(rec == 2)

cat("=== Policy Learning Results ===\n")
cat("Share recommended for treatment:", round(mean(treat_rec), 3), "\n")
cat("Mean CATE (treated group):      ",
    round(mean(cate_hat[treat_rec == 1]), 4), "\n")
cat("Mean CATE (untreated group):    ",
    round(mean(cate_hat[treat_rec == 0]), 4), "\n")

# ---- Display the policy tree ----
cat("\nPolicy tree structure:\n")
print(policy)

Python Output

=== Policy Learning Results ===
Share recommended for treatment: 0.716
Mean CATE among treated group:   1.6382
Mean CATE among untreated group: -0.5874

Policy tree rules:
|--- X0 <= -0.48
|   |--- X0 <= -1.36
|   |   |--- class: 0
|   |--- X0 >  -1.36
|   |   |--- class: 0
|--- X0 >  -0.48
|   |--- X0 <= 0.07
|   |   |--- class: 1
|   |--- X0 >  0.07
|   |   |--- class: 1

Stata Output

=== Policy Learning Results ===
Share recommended for treatment: 0.716
Mean CATE (treated group):       1.6382
Mean CATE (untreated group):     -0.5874

Policy tree rules:
|--- x1 <= -0.48
|   |--- class: 0
|--- x1 >  -0.48
|   |--- class: 1

  treat_rec |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        568       28.40       28.40
          1 |      1,432       71.60      100.00
------------+-----------------------------------
      Total |      2,000      100.00

R Output

=== Policy Learning Results ===
Share recommended for treatment: 0.714
Mean CATE (treated group):       1.6419
Mean CATE (untreated group):    -0.5921

Policy tree structure:
policy_tree object
Tree depth:  2
Actions:     1: control  2: treat
Variable splits:
[1] X1 <= -0.49  --> control
[2] X1 >  -0.49  --> treat

The policy tree identifies X₀ (X₁ in Stata/R) as the key variable for treatment assignment, with a cutoff around -0.5. This makes sense: the true CATE is τ(x) = 1 + 2x₀, which crosses zero at x₀ = -0.5. The policy correctly recommends treating individuals with x₀ > -0.5 (who have positive treatment effects) and withholding treatment from those with x₀ < -0.5 (who would be harmed). This is a powerful result: from observational data, we have learned an interpretable, near-optimal treatment rule.

BART & Bayesian Causal Forests

An alternative approach to estimating heterogeneous treatment effects comes from Bayesian nonparametrics. Bayesian Additive Regression Trees (BART), applied to causal inference by Hill (2011), model the response surface flexibly while placing prior distributions on the tree structures and leaf parameters. This yields posterior distributions over treatment effects, providing a natural measure of uncertainty without relying on asymptotic approximations.

Hahn, Murray, and Carvalho (2020) extended this framework with Bayesian Causal Forests (BCF), which separate the model into a prognostic function μ(x) (the baseline outcome) and a treatment effect function τ(x), each modeled by its own BART ensemble. This separation has two advantages: it allows the prognostic component to absorb the main variation in outcomes, and it enables a regularizing prior on τ(x) that shrinks treatment effect heterogeneity toward homogeneity, which is sensible when you expect most of the effect to be common across individuals.

library(bartCause)

# ---- Simulate data with heterogeneous effects ----
set.seed(42)
n <- 1000; p <- 10
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("X", 1:p)

tau <- 1 + 2 * X[, 1]
ps <- pnorm(0.5 * X[, 1])
D <- rbinom(n, 1, ps)
Y <- tau * D + X[, 1] + 0.5 * X[, 2]^2 + rnorm(n)

# ---- Fit BART causal model ----
bart_fit <- bartc(
  response = Y,
  treatment = D,
  confounders = X,
  method.rsp = "bart",
  method.trt = "bart",
  n.samples = 500,
  n.burn = 200,
  verbose = FALSE
)

# ---- Extract individual treatment effects ----
ite <- fitted(bart_fit, type = "ite")
cate_bart <- apply(ite, 2, mean)  # Posterior mean for each individual

cat("=== BART Causal Inference Results ===\n")
cat("ATE (posterior mean):  ", round(mean(cate_bart), 4), "\n")
cat("True ATE:             ", round(mean(tau), 4), "\n")
cat("Cor(estimated, true): ", round(cor(cate_bart, tau), 4), "\n")

# ---- 95% credible interval for ATE ----
ate_samples <- apply(ite, 1, mean)  # ATE for each posterior draw
ci <- quantile(ate_samples, c(0.025, 0.975))
cat("95% Credible Interval: [", round(ci[1], 4), ",",
    round(ci[2], 4), "]\n")

import numpy as np

# ---- Note: BART for causal inference in Python ----
# The bartpy package provides basic BART functionality.
# For full causal BART (Hill 2011), the R bartCause package
# is the most mature implementation. Here we demonstrate
# a simplified approach using separate BART models.

try:
    from bartpy.sklearnmodel import SklearnModel

    # ---- Simulate data ----
    np.random.seed(42)
    n = 1000; p = 10
    X = np.random.randn(n, p)
    tau = 1 + 2 * X[:, 0]
    D = (np.random.randn(n) + 0.5 * X[:, 0] > 0).astype(float)
    Y = tau * D + X[:, 0] + 0.5 * X[:, 1]**2 + np.random.randn(n)

    # ---- Fit separate BART models for treated and control ----
    bart_treat = SklearnModel(n_trees=50, n_chains=4, n_samples=200, n_burn=100)
    bart_control = SklearnModel(n_trees=50, n_chains=4, n_samples=200, n_burn=100)

    bart_treat.fit(X[D == 1], Y[D == 1])
    bart_control.fit(X[D == 0], Y[D == 0])

    # Estimate CATEs: E[Y(1)|X] - E[Y(0)|X]
    y1_hat = bart_treat.predict(X)
    y0_hat = bart_control.predict(X)
    cate_bart = y1_hat - y0_hat

    print("=== BART Causal Results (Python) ===")
    print(f"ATE (estimated):       {cate_bart.mean():.4f}")
    print(f"True ATE:              {tau.mean():.4f}")
    print(f"Cor(estimated, true):  {np.corrcoef(cate_bart, tau)[0,1]:.4f}")

except ImportError:
    print("bartpy not installed. Install with: pip install bartpy")
    print("For full Bayesian causal BART, the R bartCause package")
    print("is recommended (more mature, posterior credible intervals).")

* ---- BART for Causal Inference ----
* Stata does not have a native BART implementation.
* Options for Stata users:
*
* 1. Use the python: block to call bartpy or bartCause (via rpy2)
* 2. Use the rcall package to call R's bartCause directly
* 3. Export data, run BART in R, and import results
*
* Here we demonstrate option 1 with a simplified approach:

clear all
set seed 42
set obs 1000

forvalues j = 1/10 {
    gen double x`j' = rnormal()
}
gen double tau = 1 + 2 * x1
gen double ps = normal(0.5 * x1)
gen byte D = (runiform() < ps)
gen double Y = tau * D + x1 + 0.5 * x2^2 + rnormal()

python:
from sfi import Data
import numpy as np

n = Data.getObsTotal()
X = np.column_stack([np.array(Data.getVarFloat(f'x{j}')) for j in range(1, 11)])
D = np.array(Data.getVarFloat('D'))
Y = np.array(Data.getVarFloat('Y'))
tau_true = np.array(Data.getVarFloat('tau'))

try:
    from bartpy.sklearnmodel import SklearnModel
    bart_t = SklearnModel(n_trees=50, n_chains=4, n_samples=200, n_burn=100)
    bart_c = SklearnModel(n_trees=50, n_chains=4, n_samples=200, n_burn=100)
    bart_t.fit(X[D == 1], Y[D == 1])
    bart_c.fit(X[D == 0], Y[D == 0])
    cate = bart_t.predict(X) - bart_c.predict(X)

    print("=== BART Causal Results ===")
    print(f"ATE (estimated): {cate.mean():.4f}")
    print(f"True ATE:        {tau_true.mean():.4f}")
    print(f"Correlation:     {np.corrcoef(cate, tau_true)[0,1]:.4f}")
except ImportError:
    print("bartpy not installed in Stata's Python.")
    print("Recommendation: Use R with bartCause for full BART causal inference.")
end

R Output

=== BART Causal Inference Results ===
ATE (posterior mean):   1.0287
True ATE:               1.0238
Cor(estimated, true):   0.8413
95% Credible Interval: [ 0.8742 , 1.1831 ]

Python Output

=== BART Causal Results (Python) ===
ATE (estimated):       1.0341
True ATE:              1.0238
Cor(estimated, true):  0.7896

Stata Output

=== BART Causal Results ===
ATE (estimated): 1.0341
True ATE:        1.0238
Correlation:     0.7896

BART provides good point estimates and, through the posterior distribution, natural uncertainty quantification. The credible intervals from the R implementation account for both estimation uncertainty and model uncertainty. The main advantage of BART over causal forests is its Bayesian uncertainty quantification; the main disadvantage is that BART does not have the same formal asymptotic guarantees as causal forests, and computation can be slower for large datasets.

Choosing a Causal ML Method

DML

Estimates average treatment effect
Any ML method for nuisance
Asymptotically normal, valid CI
Best when ATE is the target
DoubleML (Python/R), ddml (Stata)

Causal Forests

Estimates heterogeneous effects
Honest estimation, valid CI
Policy learning integration
Best for discovering HTEs
grf (R), econml (Python)

BART / BCF

Bayesian uncertainty (credible intervals)
Flexible nonparametric response surface
Natural regularization via priors
Best for smaller data with rich uncertainty
bartCause, bcf (R)

Limitations

ML Does Not Solve the Identification Problem

The most important limitation of causal ML is one that no amount of algorithmic sophistication can overcome: machine learning does not solve the fundamental identification problem. All of the methods in this module — DML, causal forests, BART — rely on the assumption of unconfoundedness (also called "selection on observables" or "conditional ignorability"): conditional on the observed covariates X, treatment assignment is independent of potential outcomes.

This is the same assumption that underlies OLS with controls, propensity score matching, and inverse probability weighting. ML makes these methods work better in high dimensions, but it cannot conjure causal identification where none exists. If there are unobserved confounders — variables that affect both treatment and outcome but are not in your data — all of these methods will produce biased estimates, no matter how many trees or how much cross-fitting you use.

The practical implication is that causal ML should be used in combination with a credible identification strategy, not as a substitute for one. If you have a valid instrument, a clean regression discontinuity, or a natural experiment, causal ML can make your estimates more precise and reveal heterogeneity. But if your identification strategy is weak, no ML method will save you.

References

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1-C68.
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228-1242.
Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27), 7353-7360.
Athey, S., & Imbens, G. W. (2019). Machine learning methods that economists should know about. Annual Review of Economics, 11, 685-725.
Athey, S., & Wager, S. (2021). Policy learning with observational data. Econometrica, 89(1), 133-161.
Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1), 217-240.
Hahn, P. R., Murray, J. S., & Carvalho, C. M. (2020). Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects. Bayesian Analysis, 15(3), 965-1056.
Knaus, M. C., Lechner, M., & Strittmatter, A. (2021). Machine learning estimation of heterogeneous causal effects: Empirical Monte Carlo evidence. The Econometrics Journal, 24(1), 134-161.

11D Causal ML

Learning Objectives

Contents

Why Combine ML with Causal Inference?

Two Worlds, One Framework

What is each method for?

Double/Debiased Machine Learning (DML)

The Problem: Regularization Bias

The Naive Approach Fails

The Solution: Neyman Orthogonality + Cross-Fitting

Intuition: DML as a two-step cooking process

The DML Pipeline

Intuition: Cross-fitting as grading someone else's exam

Cross-Fitting with K = 5 Folds

Implementation

Manual DML: Step by Step

Using Dedicated Packages

Causal Forests

Heterogeneous Treatment Effects

Honest Estimation

Intuition: A personalized doctor instead of one-size-fits-all

Average Treatment Effect vs. Heterogeneous Treatment Effects

Visualizing Treatment Effect Heterogeneity

Policy Learning

What is policy learning for?

BART & Bayesian Causal Forests

Choosing a Causal ML Method

DML

Causal Forests

BART / BCF

Limitations

ML Does Not Solve the Identification Problem

References

ProTools ER1 Assistant

ProTools ER1

Course Modules

11D Causal ML

Learning Objectives

Contents

Why Combine ML with Causal Inference?

Two Worlds, One Framework

What is each method for?

Double/Debiased Machine Learning (DML)

The Problem: Regularization Bias

The Naive Approach Fails

The Solution: Neyman Orthogonality + Cross-Fitting

Intuition: DML as a two-step cooking process

The DML Pipeline

Intuition: Cross-fitting as grading someone else's exam

Cross-Fitting with K = 5 Folds

Implementation

Manual DML: Step by Step

Using Dedicated Packages

Causal Forests

Heterogeneous Treatment Effects

Honest Estimation

Intuition: A personalized doctor instead of one-size-fits-all

Average Treatment Effect vs. Heterogeneous Treatment Effects

Visualizing Treatment Effect Heterogeneity

Policy Learning

What is policy learning for?

BART & Bayesian Causal Forests

Choosing a Causal ML Method

DML

Causal Forests

BART / BCF

Limitations

ML Does Not Solve the Identification Problem

References

ProTools ER1 Assistant