11  Machine Learning

~10 hours Prediction & Classification Intermediate

Learning Objectives

  • Distinguish prediction from causal inference
  • Understand the bias-variance tradeoff
  • Implement supervised learning algorithms
  • Evaluate models properly with cross-validation
  • Apply ML to economics and social science problems
Prerequisites

This module assumes familiarity with causal inference concepts (Module 5) and estimation methods (Module 6). Understanding the difference between prediction and causation is central to this material.

10.1 Prediction vs. Causation

Machine learning optimizes for prediction: given inputs X, predict output Y as accurately as possible. This is fundamentally different from causal inference, which asks: what happens to Y if we change X?

Aspect Prediction (ML) Causal Inference
Goal Minimize prediction error Estimate treatment effect
Question "What Y given X?" "What if we changed X?"
Coefficients Unimportant if prediction works The whole point
Confounders Include if predictive Must address carefully
Overfitting Primary concern Less central
The Prediction-Causation Gap

A model that perfectly predicts hospital deaths might find "being on a ventilator" is associated with dying. Should we remove ventilators? Of course not--the relationship isn't causal. ML finds correlations; interpreting them causally requires additional assumptions.

10.2 The Bias-Variance Tradeoff

Prediction error can be decomposed into three components:

Error = Bias^2 + Variance + Irreducible Noise
  • Bias: Error from wrong assumptions (underfitting)
  • Variance: Error from sensitivity to training data (overfitting)
  • Irreducible: Noise inherent in the data
Simple Model Complex Model
Bias High (misses patterns) Low (captures patterns)
Variance Low (stable) High (sensitive to data)
Risk Underfitting Overfitting

10.3 Supervised Learning

Supervised learning learns from labeled examples: given (X, Y) pairs, learn f such that f(X) is approximately Y.

Linear and Logistic Regression

# Python: Supervised learning with scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score
import pandas as pd

# Load and prepare data
df = pd.read_csv("data.csv")
X = df[['age', 'education', 'experience']]
y = df['income']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit linear regression
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"RMSE: {rmse:.2f}")

# For classification, use LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train_binary)
accuracy = accuracy_score(y_test_binary, clf.predict(X_test))
# R: Supervised learning with tidymodels
library(tidymodels)

# Load data
df <- read_csv("data.csv")

# Split data
set.seed(42)
split <- initial_split(df, prop = 0.8)
train <- training(split)
test <- testing(split)

# Define model
lm_spec <- linear_reg() %>%
  set_engine("lm")

# Fit model
lm_fit <- lm_spec %>%
  fit(income ~ age + education + experience, data = train)

# Predict and evaluate
predictions <- predict(lm_fit, test)
results <- test %>% bind_cols(predictions)

rmse <- results %>%
  rmse(truth = income, estimate = .pred)
print(rmse)

10.4 Model Evaluation

Train-Test Split

Never evaluate on training data! The model memorizes training examples, so training accuracy is misleading. Always hold out a test set that the model never sees during training.

Cross-Validation

K-fold cross-validation provides more robust estimates by training K models, each holding out a different fold.

# Python: K-fold cross-validation
from sklearn.model_selection import cross_val_score, KFold

# 5-fold cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error')

print(f"CV RMSE: {(-scores.mean())**0.5:.2f} (+/- {scores.std()**0.5:.2f})")

Metrics

Task Metric Description
Regression RMSE Root Mean Squared Error
MAE Mean Absolute Error
R^2 Variance explained
Classification Accuracy Percent correct
Precision TP / (TP + FP)
Recall TP / (TP + FN)
AUC-ROC Area under ROC curve

10.5 Regularization

Regularization prevents overfitting by penalizing model complexity.

Method Penalty Effect
Ridge (L2) lambda * sum(beta^2) Shrinks coefficients toward zero
LASSO (L1) lambda * sum(|beta|) Sets some coefficients exactly to zero (variable selection)
Elastic Net alpha(L1) + (1-alpha)(L2) Combines both
# Python: LASSO with cross-validated lambda
from sklearn.linear_model import LassoCV

# LASSO with automatic lambda selection
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X_train, y_train)

print(f"Best lambda: {lasso.alpha_:.4f}")
print(f"Non-zero coefficients: {(lasso.coef_ != 0).sum()}")

10.6 Tree-Based Methods

Decision trees partition the feature space into regions, predicting the mean (regression) or mode (classification) in each region.

Random Forests

Random forests average many trees, each trained on a bootstrap sample with random feature subsets. This reduces variance dramatically.

Gradient Boosting

Gradient boosting builds trees sequentially, with each tree correcting the errors of previous trees. XGBoost and LightGBM are popular implementations.

# Python: Random Forest and XGBoost
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb

# Random Forest
rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X_train, y_train)

# Feature importance
importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

# XGBoost
xgb_model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=5)
xgb_model.fit(X_train, y_train)

10.7 Neural Networks

Neural networks learn hierarchical representations through layers of nonlinear transformations. Deep learning refers to networks with many layers.

# Python: Neural network with PyTorch
import torch
import torch.nn as nn

class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.layer1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.layer2(x)
        return x

# Create model
model = SimpleNN(input_size=10, hidden_size=64, output_size=1)

# Loss and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

10.8 ML for Causal Inference

A growing field combines ML's predictive power with causal inference frameworks. Key approaches:

Method Use Case
Double ML (DML) Use ML to partial out confounders, then estimate treatment effect
Causal Forests Estimate heterogeneous treatment effects
LASSO for Controls Variable selection for control variables
Synthetic Control ML-weighted comparison units
Key References
  • Athey, S. & Imbens, G. (2019). Machine Learning Methods That Economists Should Know About. Annual Review of Economics.
  • Chernozhukov, V., et al. (2018). Double/Debiased Machine Learning for Treatment and Structural Parameters. Econometrics Journal.
  • Wager, S. & Athey, S. (2018). Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. JASA.

10.9 Exercises

Exercise 10.1: Machine Learning Workflow

Practice a basic supervised learning workflow. Complete these tasks:

  1. Split data into training and test sets
  2. Fit a model (linear regression or random forest)
  3. Evaluate model performance (RMSE or accuracy)