11  Machine Learning

~12 hours total Prediction, Regularization, Trees, Neural Networks, Causal ML

Learning Objectives

  • Distinguish prediction tasks from causal inference tasks
  • Understand the bias-variance tradeoff and why it matters
  • Know when to use regularization, trees, or neural networks
  • Apply ML methods to economics and social science problems
  • Use ML as a tool for causal inference (DML, causal forests)
  • Evaluate and compare models properly

Prerequisites

This module assumes familiarity with linear regression, hypothesis testing, and basic causal inference concepts from Modules 5–7 (Data Analysis, Causal Inference, Estimation Methods). You should be comfortable running regressions and interpreting coefficients in at least one of the three course languages.

The ML Mindset in One Paragraph

Think of traditional econometrics like a tailor who makes custom suits: you specify exactly how the garment should look (the functional form), and it fits one body (your data) precisely. Machine learning is more like a clothing factory that learns patterns from thousands of customers and produces garments that fit new people well, even ones it has never measured. The factory does not care why certain body proportions go together — only that the patterns are predictive. This module teaches you when to use the tailor (causal econometrics) and when to use the factory (ML), and how they can work together.

11.1  Why Machine Learning for Economists?

Economists have traditionally focused on causal inference: understanding why things happen and estimating the effect of specific interventions. This is the domain of the randomized experiment, the instrumental variable, and the difference-in-differences design. But a growing body of research recognizes that many important policy and research questions are fundamentally about prediction, not causation.

Why This Matters

Many real policy decisions are prediction problems in disguise. When a government agency needs to decide which households should receive benefits, the question is not “what is the causal effect of income on poverty?” but rather “which households are most likely to be poor?” When a judge must decide whether to grant bail, the relevant question is “how likely is this person to reoffend?” ML is built for these tasks.

Mullainathan and Spiess (2017) formalized this insight with the concept of prediction policy problems. They argue that whenever the policy-relevant quantity is a prediction — rather than a causal parameter — machine learning methods are the natural tool. Traditional econometric models like OLS are designed to produce unbiased coefficient estimates, not to minimize prediction error. ML methods, by contrast, are explicitly optimized for predictive accuracy. They can capture complex nonlinear relationships, handle hundreds or thousands of predictors, and automatically determine which variables matter most — all without requiring the researcher to specify the functional form in advance.

The range of prediction tasks relevant to economists is broader than you might initially think:

  • Text classification: Is this central bank statement hawkish or dovish?
  • Image recognition: Can satellite night-light imagery predict economic development?
  • Nowcasting: What is GDP this quarter, before official data arrives?
  • Targeting interventions: Which students are at risk of dropping out?
  • Variable selection: Which of 200 potential predictors of GDP growth actually matter?

In each of these cases, the researcher does not need to know why certain features predict the outcome — only that they do.

Crucially, machine learning complements rather than replaces econometric methods. It is not a substitute for a well-identified causal design when your question is about causation. Instead, ML provides the right toolkit for prediction problems and, increasingly, serves as an input to causal estimation itself. In Module 11D, you will see how methods like Double/Debiased Machine Learning (DML) and causal forests use ML predictions as building blocks for valid causal inference. Understanding ML is therefore essential for the modern applied economist — not as a replacement for your existing skills, but as a powerful addition to your methodological toolkit.

11.2  Prediction vs. Causation

Think of it this way: a weather forecast (prediction) tells you whether to bring an umbrella. A cloud-seeding experiment (causal inference) tells you whether humans can make it rain. Both are useful, but they answer fundamentally different questions and require different tools.

The most important conceptual distinction in this module is between prediction and causation. Prediction asks: “Given what I observe about this individual, what is my best guess for their outcome Y?” Causation asks: “If I change X, what happens to Y?” These are fundamentally different questions, and the tools optimized for one are generally not optimal for the other.

Consider a concrete example. Suppose you have data on student characteristics and their exam scores. A prediction question would be: “Given a new student’s GPA, attendance record, and reported study hours, what score will they get on the final exam?” A causal question would be: “If we increase study hours by one hour per week, how much does the exam score improve?” For the prediction question, it does not matter whether GPA causes higher scores or is merely correlated with them — all that matters is that GPA helps us predict the outcome. For the causal question, we need to worry about confounders (perhaps more motivated students both study more and score higher), reverse causality, and identification strategies.

Prediction Causation
Goal Minimize forecast error Estimate a causal effect
Method Any model that predicts well Requires an identification strategy
Evaluation Out-of-sample accuracy (RMSE, AUC) Unbiasedness, consistency
Coefficients Not individually interpretable Each has structural meaning
Confounders Helpful (more predictive power) Must be controlled for or eliminated
Economics example “Which firms will default next year?” “Does raising interest rates reduce defaults?”

In a prediction framework, we do not care about the interpretability of individual coefficients. A model with 500 features and complex interactions can be perfectly valid if it predicts well on new data. What matters is out-of-sample accuracy: how well does the model perform on data it has never seen? This is a fundamentally different criterion than the unbiasedness or consistency that we prize in causal inference.

Athey and Imbens (2019) emphasize that ML methods are explicitly optimized for prediction, not for causal estimation. Using them naively for causal questions — for example, interpreting a random forest’s variable importance as a causal effect — can be deeply misleading. The variable importance might reflect correlation, reverse causation, or confounding rather than a genuine causal relationship. Throughout this module, we will be careful to distinguish when we are using ML for prediction versus when we are using ML-augmented methods that maintain causal validity.

A Basic Prediction Pipeline

Imagine you are studying for an exam with two sets of practice problems. You learn the material using Set A (training), then check whether you truly understand it by testing yourself on Set B (test) — problems you have never seen. If you only checked yourself on Set A, you might mistake memorization for understanding. The train/test split works the same way for models.

Let us set up the fundamental workflow that underlies every ML task: split the data into training and test sets, fit a model on the training data, and evaluate its performance on the test data. This train/test paradigm is the foundation of everything that follows.

The Train/Test Split

Your Full Dataset (N observations) Training Set (80%) — model learns from this Test (20%)

The model never sees test data during training. This ensures an honest measure of real-world performance.

What the Code Below Does

We simulate a dataset where only 3 out of 10 features actually matter (mimicking a common economics scenario where many candidate predictors are noise). We then split the data 80/20, fit an OLS regression on the training portion, and measure how well it predicts the held-out test data. The key metric is RMSE (root mean squared error) — lower values mean better predictions.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate simulated data: 10 features, only first 3 matter
np.random.seed(42)
n = 500
X = np.random.randn(n, 10)
y = 2 * X[:, 0] + 1.5 * X[:, 1] - 1 * X[:, 2] + np.random.randn(n) * 0.5

# Step 1: Split into training (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Fit model on training data only
model = LinearRegression()
model.fit(X_train, y_train)

# Step 3: Predict on test data
y_pred = model.predict(X_test)

# Step 4: Evaluate out-of-sample performance
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Training observations: {len(X_train)}")
print(f"Test observations:     {len(X_test)}")
print(f"Test RMSE:             {rmse:.4f}")
print(f"R-squared (test):      {model.score(X_test, y_test):.4f}")
* Generate simulated data: 10 features, only first 3 matter
clear
set seed 42
set obs 500

* Create 10 features
forvalues i = 1/10 {
    gen x`i' = rnormal()
}

* True relationship: only first 3 features matter
gen y = 2*x1 + 1.5*x2 - 1*x3 + rnormal()*0.5

* Step 1: Create a train/test split indicator
gen split = runiform()
gen test = (split > 0.8)

* Step 2: Fit model on training data only
regress y x1-x10 if test == 0

* Step 3: Predict on all data (including test)
predict yhat

* Step 4: Evaluate on test set only
gen resid_sq = (y - yhat)^2 if test == 1
summarize resid_sq if test == 1
display "Test RMSE: " sqrt(r(mean))
count if test == 0
display "Training obs: " r(N)
count if test == 1
display "Test obs: " r(N)
library(rsample)

# Generate simulated data: 10 features, only first 3 matter
set.seed(42)
n <- 500
X <- matrix(rnorm(n * 10), ncol = 10)
colnames(X) <- paste0("x", 1:10)
y <- 2 * X[, 1] + 1.5 * X[, 2] - 1 * X[, 3] + rnorm(n) * 0.5
df <- data.frame(y = y, X)

# Step 1: Split into training (80%) and test (20%)
split <- initial_split(df, prop = 0.8)
train_data <- training(split)
test_data  <- testing(split)

# Step 2: Fit model on training data
model <- lm(y ~ ., data = train_data)

# Step 3: Predict on test data
y_pred <- predict(model, newdata = test_data)

# Step 4: Evaluate out-of-sample performance
rmse_val <- sqrt(mean((test_data$y - y_pred)^2))
cat("Training observations:", nrow(train_data), "\n")
cat("Test observations:    ", nrow(test_data), "\n")
cat("Test RMSE:            ", round(rmse_val, 4), "\n")
r2 <- 1 - sum((test_data$y - y_pred)^2) / sum((test_data$y - mean(test_data$y))^2)
cat("R-squared (test):     ", round(r2, 4), "\n")
Python Output
Training observations: 400
Test observations:     100
Test RMSE:             0.5182
R-squared (test):      0.9571
Stata Output
      Source |       SS           df       MS
-------------+----------------------------------
       Model |  2348.12741        10  234.812741
    Residual |   96.5488209      389  .248197483
-------------+----------------------------------
       Total |  2444.67623       399  6.12700810

Test RMSE: .50943
Training obs: 401
Test obs: 99
R Output
Training observations: 400
Test observations:     100
Test RMSE:             0.5237
R-squared (test):      0.9548

Syntax Breakdown: The Train/Test Split

Python: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) — tuple unpacking captures four arrays at once. test_size is the fraction reserved for testing (0.2 = 20%).

R: split <- initial_split(df, prop=0.8) creates a split object; extract with training(split) and testing(split). The prop parameter sets the training fraction.

Stata: No built-in splitter; generate a random uniform variable and threshold it: gen test = (runiform() > 0.8). Then use if test == 0 / if test == 1 to separate train and test.

11.3  The Bias-Variance Tradeoff

Think of overfitting like memorizing the answers to a practice test rather than learning the material. You will ace that specific practice test, but you will bomb the real exam because you never understood the underlying concepts. A model that memorizes training data (overfitting) fails on new data for the same reason.

The bias-variance tradeoff is arguably the single most important concept in machine learning. It explains why simple models sometimes outperform complex ones, why fitting your training data perfectly is usually a mistake, and how to think about the fundamental tension between underfitting and overfitting. Every ML method you will learn in this module — from LASSO to random forests to neural networks — can be understood as a particular way of navigating this tradeoff.

Why This Matters for Economists

Economists are trained to avoid omitted variable bias (underfitting), but ML teaches an equally important lesson: including too many variables or too much flexibility can increase prediction error (overfitting). When you have 200 potential predictors of GDP growth, throwing them all into OLS may actually predict worse than a simpler model. LASSO, trees, and neural networks are all methods for finding the sweet spot.

The key insight is that the expected prediction error for any model can be decomposed into three components:

E[(Y − f̂(X))²] = Bias²(f̂) + Var(f̂) + σ²

Bias measures how far off our model’s average prediction is from the truth. A model with high bias makes systematically wrong predictions because it is too simple to capture the true relationship. For instance, fitting a straight line to data that follows a curve produces high bias — the model simply cannot represent the underlying pattern, no matter how much data we give it. This is underfitting.

Variance measures how much our model’s predictions fluctuate across different training samples. A model with high variance is highly sensitive to the particular data points it was trained on. If you were to re-train the model on a slightly different random sample, the predictions would change dramatically. A 20th-degree polynomial that wiggles through every training point has high variance — it has essentially memorized the noise in that particular sample. This is overfitting.

The third term, σ², is the irreducible error — the inherent randomness in the data that no model can eliminate. Even with a perfect model and infinite data, you cannot predict an outcome perfectly if there is genuine noise in the system. This term sets a floor on how well any model can perform.

The fundamental tradeoff is this: as model complexity increases, bias decreases (the model can capture more complex patterns) but variance increases (the model becomes more sensitive to the training data). The optimal model is the one that minimizes the total error — the sweet spot where the sum of bias-squared and variance is at its minimum. Every ML method has one or more tuning parameters (hyperparameters) that control this tradeoff: the penalty parameter λ in LASSO, the maximum depth in decision trees, the number of hidden layers in a neural network.

The Bias-Variance Tradeoff

Prediction Error Model Complexity σ² Bias² Variance Total Error Sweet Spot Underfitting Overfitting

As complexity increases, bias falls but variance rises. The best model minimizes their sum.

Demonstrating Overfitting with Polynomial Fits

What the Code Below Does

We generate data from a noisy sine curve, then fit polynomials of degree 1 (straight line), 5 (smooth curve), and 15 (wiggly curve). Watch what happens: training error always drops as the model gets more complex, but test error rises sharply for degree 15. That divergence is overfitting. In economics terms, it is like building a model that perfectly explains historical GDP data but gives terrible forecasts.

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

np.random.seed(42)

# Generate noisy sine data
n_train, n_test = 30, 200
X_train = np.sort(np.random.uniform(0, 2 * np.pi, n_train)).reshape(-1, 1)
X_test  = np.linspace(0, 2 * np.pi, n_test).reshape(-1, 1)
y_train = np.sin(X_train).ravel() + np.random.randn(n_train) * 0.3
y_test  = np.sin(X_test).ravel() + np.random.randn(n_test) * 0.3

# Fit polynomials of degree 1, 5, and 15
print(f"{'Degree':<10} {'Train MSE':<15} {'Test MSE':<15}")
print("-" * 40)

for degree in [1, 5, 15]:
    poly = PolynomialFeatures(degree)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly  = poly.transform(X_test)

    model = LinearRegression()
    model.fit(X_train_poly, y_train)

    train_mse = mean_squared_error(y_train, model.predict(X_train_poly))
    test_mse  = mean_squared_error(y_test,  model.predict(X_test_poly))

    print(f"{degree:<10} {train_mse:<15.4f} {test_mse:<15.4f}")
* Generate noisy sine data
clear
set seed 42
set obs 230

* First 30 obs = training, rest = test
gen id = _n
gen test = (id > 30)
gen x = runiform() * 2 * _pi if test == 0
replace x = (id - 30) / 200 * 2 * _pi if test == 1
gen y = sin(x) + rnormal() * 0.3

* Generate polynomial terms (x^2 through x^15)
forvalues p = 2/15 {
    gen x`p' = x^`p'
}

* Fit polynomial regressions of increasing degree
display "Degree     Train MSE      Test MSE"
display "----------------------------------------"

* Degree 1
quietly regress y x if test == 0
predict yhat1
gen se1_train = (y - yhat1)^2 if test == 0
gen se1_test  = (y - yhat1)^2 if test == 1
quietly summarize se1_train
local tr1 = r(mean)
quietly summarize se1_test
display "1          " %9.4f `tr1' "      " %9.4f r(mean)

* Degree 5
quietly regress y x x2-x5 if test == 0
predict yhat5
gen se5_train = (y - yhat5)^2 if test == 0
gen se5_test  = (y - yhat5)^2 if test == 1
quietly summarize se5_train
local tr5 = r(mean)
quietly summarize se5_test
display "5          " %9.4f `tr5' "      " %9.4f r(mean)

* Degree 15
quietly regress y x x2-x15 if test == 0
predict yhat15
gen se15_train = (y - yhat15)^2 if test == 0
gen se15_test  = (y - yhat15)^2 if test == 1
quietly summarize se15_train
local tr15 = r(mean)
quietly summarize se15_test
display "15         " %9.4f `tr15' "      " %9.4f r(mean)
set.seed(42)

# Generate noisy sine data
n_train <- 30; n_test <- 200
x_train <- sort(runif(n_train, 0, 2 * pi))
x_test  <- seq(0, 2 * pi, length.out = n_test)
y_train <- sin(x_train) + rnorm(n_train, sd = 0.3)
y_test  <- sin(x_test) + rnorm(n_test, sd = 0.3)

# Fit polynomials of degree 1, 5, and 15
degrees <- c(1, 5, 15)
results <- data.frame(Degree = integer(), Train_MSE = numeric(), Test_MSE = numeric())

for (d in degrees) {
  # Fit polynomial regression
  model <- lm(y_train ~ poly(x_train, d, raw = TRUE))

  # Predict on train and test
  pred_train <- predict(model)
  pred_test  <- predict(model, newdata = data.frame(x_train = x_test))

  # Compute MSE
  train_mse <- mean((y_train - pred_train)^2)
  test_mse  <- mean((y_test - pred_test)^2)

  results <- rbind(results, data.frame(Degree = d, Train_MSE = train_mse, Test_MSE = test_mse))
}

print(results, row.names = FALSE)
Python Output
Degree     Train MSE       Test MSE
----------------------------------------
1          0.2194          0.2487
5          0.0621          0.0845
15         0.0298          0.5731
Stata Output
Degree     Train MSE      Test MSE
----------------------------------------
1             0.2237         0.2512
5             0.0654         0.0891
15            0.0312         0.5843
R Output
 Degree  Train_MSE  Test_MSE
      1    0.21506   0.24632
      5    0.06348   0.08714
     15    0.03012   0.56928

Notice the pattern: training MSE always decreases as the polynomial degree increases (the model fits the training data better and better). But test MSE first decreases (from degree 1 to degree 5) and then increases sharply (from degree 5 to degree 15). The degree-15 polynomial has essentially memorized the 30 training points, including their noise, and makes wild predictions on new data. This is overfitting in action, and it is precisely the bias-variance tradeoff at work.

11.4  Supervised vs. Unsupervised Learning

Supervised learning is like a teacher grading practice essays — the student (model) knows what the right answer looks like and adjusts accordingly. Unsupervised learning is like sorting a pile of unlabeled photographs into groups — there is no “correct” answer, just patterns to discover.

Machine learning methods are broadly divided into two families: supervised learning and unsupervised learning. The distinction is straightforward: in supervised learning, we have a labeled outcome variable Y that we want to predict from features X. In unsupervised learning, there is no outcome variable — we are simply looking for patterns and structure in the data. The name “supervised” comes from the idea that the labels “supervise” the learning process by telling the algorithm what the right answer is.

Within supervised learning, there are two main subtypes. Regression problems have a continuous outcome (predicting house prices, GDP growth, or exam scores). Classification problems have a categorical outcome (classifying emails as spam or not, predicting whether a loan will default, or identifying the sentiment of a text). The same algorithms (like decision trees or neural networks) can often handle both types, though the loss function and evaluation metrics differ. For regression, we typically use RMSE or MAE; for classification, accuracy, precision, recall, or AUC.

Accuracy is misleading with imbalanced classes

When one class dominates the data (e.g., 95% of loans do not default), a model that always predicts the majority class achieves 95% accuracy while being completely useless. For imbalanced classification, use precision (of those predicted positive, how many truly are), recall (of actual positives, how many were caught), or AUC (how well the model discriminates across all thresholds). See Module 11E: Model Evaluation for details.

Unsupervised learning is used when we want to discover structure without a specific prediction target. Clustering methods (like k-means) group similar observations together — for example, segmenting countries by economic characteristics or grouping consumers by purchasing behavior. Dimensionality reduction methods (like PCA) compress many variables into a smaller number of components that capture the most important variation. For economists, most ML applications in research are supervised — predicting outcomes, classifying text, or using ML predictions as inputs to causal estimation. However, unsupervised methods appear in text analysis (topic models), factor models, and exploratory data analysis.

Supervised Unsupervised
Goal Predict Y from X Find structure in X
Data Labeled (X, Y) Unlabeled (X only)
Examples LASSO, Random Forest, Neural Nets K-means, PCA, Topic Models
Evaluation Compare predictions to known labels Internal metrics (silhouette, variance explained)
Economics use Prediction policy, causal ML, targeting Text clustering, factor models, country grouping

11.5  Cross-Validation & Model Selection

A single train/test split is like evaluating a chef based on one dish — maybe they got lucky, or maybe the judge was lenient. Cross-validation is like having the chef cook five different meals for five different panels of judges. The average score across all five panels gives a much more reliable assessment of their skill.

We have established that out-of-sample performance is what matters for prediction. But a single train/test split has two problems. First, it wastes data — the observations in the test set are never used for training. Second, the results depend on the particular random split: you might get a favorable split where the test set happens to be easy, or an unfavorable one where it is unusually hard. K-fold cross-validation solves both problems by systematically rotating which data is used for training and testing.

Why This Matters

Cross-validation is how you choose hyperparameters — the settings that control your model's complexity. For LASSO, it is the penalty λ. For random forests, it is the number of trees and maximum depth. For neural networks, it is the architecture and learning rate. Without CV, you are guessing these settings. With CV, you let the data tell you the optimal configuration.

The procedure works as follows. Divide your data into K equally sized “folds” (groups). For each fold k = 1, ..., K: train the model on all data except fold k, then evaluate it on fold k. After all K iterations, every observation has been used exactly once as a test point. The average of the K test errors gives you a robust estimate of out-of-sample performance. Typical choices are K = 5 or K = 10. The extreme case K = N (called leave-one-out cross-validation, or LOOCV) trains the model N times, each time leaving out a single observation. LOOCV has the lowest bias but can be computationally expensive and sometimes has higher variance than K = 10.

5-Fold Cross-Validation

Fold 1 Test Train Train Train Train RMSE_1 Fold 2 Train Test Train Train Train RMSE_2 Fold 3 Train Train Test Train Train RMSE_3 Fold 4 Train Train Train Test Train RMSE_4 Fold 5 Train Train Train Train Test RMSE_5 CV Score = mean(RMSE_1, ..., RMSE_5)

Each observation is used as a test point exactly once. The final CV score is the average of all 5 test errors.

An important practical point: always do your cross-validation before looking at the test set. The test set should be touched only once, at the very end, to report your model’s final performance. If you repeatedly evaluate on the test set and adjust your model, you are effectively training on the test set, and your reported performance will be overly optimistic. Some practitioners advocate for a three-way split: training set (for fitting), validation set (for hyperparameter tuning), and test set (for final evaluation). Cross-validation on the training set replaces the need for a separate validation set.

What the Code Below Does

We perform 5-fold cross-validation on the same simulated dataset. Instead of a single RMSE number, we now get five estimates (one per fold). The mean and standard deviation of these five numbers tell us both how well the model performs and how stable that estimate is. A low standard deviation means the performance is consistent regardless of which observations happen to be in the test fold.

import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression

# Use the same simulated data
np.random.seed(42)
n = 500
X = np.random.randn(n, 10)
y = 2 * X[:, 0] + 1.5 * X[:, 1] - 1 * X[:, 2] + np.random.randn(n) * 0.5

# Set up 5-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = LinearRegression()

# cross_val_score returns negative MSE (sklearn convention: higher = better)
scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')

# Convert to positive RMSE
rmse_scores = np.sqrt(-scores)

print("RMSE for each fold:")
for i, score in enumerate(rmse_scores, 1):
    print(f"  Fold {i}: {score:.4f}")
print(f"\nMean RMSE:  {rmse_scores.mean():.4f}")
print(f"Std RMSE:   {rmse_scores.std():.4f}")
* Manual 5-fold cross-validation in Stata
clear
set seed 42
set obs 500

* Generate data
forvalues i = 1/10 {
    gen x`i' = rnormal()
}
gen y = 2*x1 + 1.5*x2 - 1*x3 + rnormal()*0.5

* Assign observations to 5 folds randomly
gen u = runiform()
sort u
gen fold = mod(_n - 1, 5) + 1

* Perform 5-fold CV
gen cv_resid_sq = .

forvalues k = 1/5 {
    quietly regress y x1-x10 if fold != `k'
    predict yhat_temp if fold == `k'
    replace cv_resid_sq = (y - yhat_temp)^2 if fold == `k'
    drop yhat_temp
}

* Report RMSE per fold and overall
display "RMSE for each fold:"
forvalues k = 1/5 {
    quietly summarize cv_resid_sq if fold == `k'
    display "  Fold `k': " %7.4f sqrt(r(mean))
}
quietly summarize cv_resid_sq
display _newline "Mean RMSE: " %7.4f sqrt(r(mean))
library(rsample)

# Generate data
set.seed(42)
n <- 500
X <- matrix(rnorm(n * 10), ncol = 10)
colnames(X) <- paste0("x", 1:10)
y <- 2 * X[, 1] + 1.5 * X[, 2] - 1 * X[, 3] + rnorm(n) * 0.5
df <- data.frame(y = y, X)

# Create 5-fold CV splits
folds <- vfold_cv(df, v = 5)

# Compute RMSE for each fold
rmse_vals <- sapply(1:5, function(i) {
  train_data <- analysis(folds$splits[[i]])
  test_data  <- assessment(folds$splits[[i]])
  model <- lm(y ~ ., data = train_data)
  preds <- predict(model, newdata = test_data)
  sqrt(mean((test_data$y - preds)^2))
})

cat("RMSE for each fold:\n")
for (i in 1:5) {
  cat(sprintf("  Fold %d: %.4f\n", i, rmse_vals[i]))
}
cat(sprintf("\nMean RMSE:  %.4f\n", mean(rmse_vals)))
cat(sprintf("Std RMSE:   %.4f\n", sd(rmse_vals)))
Python Output
RMSE for each fold:
  Fold 1: 0.4973
  Fold 2: 0.5241
  Fold 3: 0.5108
  Fold 4: 0.4856
  Fold 5: 0.5322

Mean RMSE:  0.5100
Std RMSE:   0.0173
Stata Output
RMSE for each fold:
  Fold 1:  0.5012
  Fold 2:  0.5198
  Fold 3:  0.5067
  Fold 4:  0.4921
  Fold 5:  0.5284

Mean RMSE:  0.5096
R Output
RMSE for each fold:
  Fold 1: 0.5031
  Fold 2: 0.5187
  Fold 3: 0.5095
  Fold 4: 0.4889
  Fold 5: 0.5299

Mean RMSE:  0.5100
Std RMSE:   0.0139

Syntax Breakdown: Cross-Validation

Python: cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error') — returns an array of 5 scores. The neg_ prefix is a scikit-learn convention (it maximizes scores, so MSE is negated). Take np.sqrt(-scores) for RMSE.

R: vfold_cv(df, v=5) creates the folds; then loop over analysis() and assessment() to extract train/test for each fold. No single built-in function like Python's cross_val_score, but tidymodels' fit_resamples() automates this in more advanced workflows.

Stata: No built-in CV function. Create fold assignments with gen fold = mod(_n - 1, 5) + 1 after random sorting, then loop with forvalues k = 1/5, fitting on if fold != `k' and predicting on if fold == `k'.

11.6  Module Roadmap

This module is organized into five subpages, each covering a major family of ML methods. Work through them in order, as later sections build on concepts from earlier ones.

11A: Regularization

Ridge, LASSO, Elastic Net, Post-LASSO for causal inference

~2.5 hours
11B: Tree-Based Methods

Decision Trees, Random Forests, Gradient Boosting (XGBoost)

~2.5 hours
11C: Neural Networks

Perceptrons, backpropagation, deep learning fundamentals

~2.5 hours
11D: Causal ML

Double/Debiased ML, Causal Forests, heterogeneous treatment effects

~2.5 hours
11E: Model Evaluation

Metrics, confusion matrices, ROC curves, calibration, comparing models

~2 hours

Which Method Should I Use?

The decision depends on your data, your question, and the tradeoff between interpretability and predictive power. Here is a simplified guide:

What is your goal? Prediction Causal effect What kind of data? Tabular Text/Image Need interpretability? Yes LASSO (11A) No XGBoost / RF (11B) Neural Nets (11C) Need heterogeneity? No (ATE) DML (11D) Yes (HTE) Causal Forest (11D) Model Evaluation (11E) — always! ATE = Average Treatment Effect | HTE = Heterogeneous Treatment Effects | RF = Random Forest

Comparing the Major Method Families

Regularization (11A)

  • Linear models with penalty terms
  • Fast, interpretable, well-understood theory
  • Best when relationship is approximately linear
  • LASSO provides automatic variable selection

Economics use: LASSO helps when you have 200 potential predictors of GDP growth and need to find the 10 that matter.

Trees (11B)

  • Nonparametric, capture interactions naturally
  • Random forests and boosting are very powerful
  • No need to specify functional form
  • Often best “out-of-the-box” performance

Economics use: Predicting loan defaults when the risk depends on complex interactions between income, debt, and employment.

Neural Networks (11C)

  • Universal function approximators
  • Excel with unstructured data (text, images)
  • Require large datasets and careful tuning
  • Less common in applied economics (so far)

Economics use: Classifying satellite imagery to measure economic activity, or processing central bank texts for sentiment.

References

  • Mullainathan, S. & Spiess, J. (2017). “Machine Learning: An Applied Econometric Approach.” Journal of Economic Perspectives, 31(2), 87–106.
  • Athey, S. & Imbens, G. (2019). “Machine Learning Methods That Economists Should Know About.” Annual Review of Economics, 11, 685–725.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. 2nd ed. Springer.