11 Machine Learning

~12 hours total Prediction, Regularization, Trees, Neural Networks, Causal ML

Learning Objectives

Distinguish prediction tasks from causal inference tasks
Understand the bias-variance tradeoff and why it matters
Know when to use regularization, trees, or neural networks
Apply ML methods to economics and social science problems
Use ML as a tool for causal inference (DML, causal forests)
Evaluate and compare models properly

Prerequisites

This module assumes familiarity with linear regression, hypothesis testing, and basic causal inference concepts from Modules 5–7 (Data Analysis, Causal Inference, Estimation Methods). You should be comfortable running regressions and interpreting coefficients in at least one of the three course languages.

11.1 Why Machine Learning for Economists?

Economists have traditionally focused on causal inference: understanding why things happen and estimating the effect of specific interventions. This is the domain of the randomized experiment, the instrumental variable, and the difference-in-differences design. But a growing body of research recognizes that many important policy and research questions are fundamentally about prediction, not causation. When a government agency needs to decide which households should receive benefits, the question is not “what is the causal effect of income on poverty?” but rather “which households are most likely to be poor?” When a judge must decide whether to grant bail, the relevant question is “how likely is this person to reoffend?” — a prediction task.

Mullainathan and Spiess (2017) formalized this insight with the concept of prediction policy problems. They argue that whenever the policy-relevant quantity is a prediction — rather than a causal parameter — machine learning methods are the natural tool. Traditional econometric models like OLS are designed to produce unbiased coefficient estimates, not to minimize prediction error. ML methods, by contrast, are explicitly optimized for predictive accuracy. They can capture complex nonlinear relationships, handle hundreds or thousands of predictors, and automatically determine which variables matter most — all without requiring the researcher to specify the functional form in advance.

The range of prediction tasks relevant to economists is broader than you might initially think. Text classification (is this central bank statement hawkish or dovish?), image recognition (can satellite imagery predict economic development?), nowcasting economic indicators (what is GDP this quarter, before official data arrives?), and targeting interventions (which students are at risk of dropping out?) are all examples where ML outperforms traditional approaches. Even in the medical sciences, ML models for diagnosis and risk prediction are proving transformative. In each of these cases, the researcher does not need to know why certain features predict the outcome — only that they do.

Crucially, machine learning complements rather than replaces econometric methods. It is not a substitute for a well-identified causal design when your question is about causation. Instead, ML provides the right toolkit for prediction problems and, increasingly, serves as an input to causal estimation itself. In Module 11D, you will see how methods like Double/Debiased Machine Learning (DML) and causal forests use ML predictions as building blocks for valid causal inference. Understanding ML is therefore essential for the modern applied economist — not as a replacement for your existing skills, but as a powerful addition to your methodological toolkit.

11.2 Prediction vs. Causation

The most important conceptual distinction in this module is between prediction and causation. Prediction asks: “Given what I observe about this individual, what is my best guess for their outcome Y?” Causation asks: “If I change X, what happens to Y?” These are fundamentally different questions, and the tools optimized for one are generally not optimal for the other.

Consider a concrete example. Suppose you have data on student characteristics and their exam scores. A prediction question would be: “Given a new student’s GPA, attendance record, and reported study hours, what score will they get on the final exam?” A causal question would be: “If we increase study hours by one hour per week, how much does the exam score improve?” For the prediction question, it does not matter whether GPA causes higher scores or is merely correlated with them — all that matters is that GPA helps us predict the outcome. For the causal question, we need to worry about confounders (perhaps more motivated students both study more and score higher), reverse causality, and identification strategies.

	Prediction	Causation
Goal	Minimize forecast error	Estimate a causal effect
Method	Any model that predicts well	Requires an identification strategy
Evaluation	Out-of-sample accuracy (RMSE, AUC)	Unbiasedness, consistency
Coefficients	Not individually interpretable	Each has structural meaning
Confounders	Helpful (more predictive power)	Must be controlled for or eliminated

In a prediction framework, we do not care about the interpretability of individual coefficients. A model with 500 features and complex interactions can be perfectly valid if it predicts well on new data. What matters is out-of-sample accuracy: how well does the model perform on data it has never seen? This is a fundamentally different criterion than the unbiasedness or consistency that we prize in causal inference.

Athey and Imbens (2019) emphasize that ML methods are explicitly optimized for prediction, not for causal estimation. Using them naively for causal questions — for example, interpreting a random forest’s variable importance as a causal effect — can be deeply misleading. The variable importance might reflect correlation, reverse causation, or confounding rather than a genuine causal relationship. Throughout this module, we will be careful to distinguish when we are using ML for prediction versus when we are using ML-augmented methods that maintain causal validity.

A Basic Prediction Pipeline

Let us set up the fundamental workflow that underlies every ML task: split the data into training and test sets, fit a model on the training data, and evaluate its performance on the test data. This train/test paradigm is the foundation of everything that follows.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate simulated data
np.random.seed(42)
n = 500
X = np.random.randn(n, 10)  # 10 features
# True relationship: only first 3 features matter
y = 2 * X[:, 0] + 1.5 * X[:, 1] - 1 * X[:, 2] + np.random.randn(n) * 0.5

# Step 1: Split into training (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Fit model on training data only
model = LinearRegression()
model.fit(X_train, y_train)

# Step 3: Predict on test data
y_pred = model.predict(X_test)

# Step 4: Evaluate out-of-sample performance
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Training observations: {len(X_train)}")
print(f"Test observations:     {len(X_test)}")
print(f"Test RMSE:             {rmse:.4f}")
print(f"R-squared (test):      {model.score(X_test, y_test):.4f}")

* Generate simulated data
clear
set seed 42
set obs 500

* Create 10 features
forvalues i = 1/10 {
    gen x`i' = rnormal()
}

* True relationship: only first 3 features matter
gen y = 2*x1 + 1.5*x2 - 1*x3 + rnormal()*0.5

* Step 1: Create a train/test split indicator
gen split = runiform()
gen test = (split > 0.8)

* Step 2: Fit model on training data only
regress y x1-x10 if test == 0

* Step 3: Predict on all data (including test)
predict yhat

* Step 4: Evaluate on test set only
gen resid_sq = (y - yhat)^2 if test == 1
summarize resid_sq if test == 1
display "Test RMSE: " sqrt(r(mean))
count if test == 0
display "Training obs: " r(N)
count if test == 1
display "Test obs: " r(N)

library(rsample)
library(yardstick)

# Generate simulated data
set.seed(42)
n <- 500
X <- matrix(rnorm(n * 10), ncol = 10)
colnames(X) <- paste0("x", 1:10)
y <- 2 * X[, 1] + 1.5 * X[, 2] - 1 * X[, 3] + rnorm(n) * 0.5
df <- data.frame(y = y, X)

# Step 1: Split into training (80%) and test (20%)
split <- initial_split(df, prop = 0.8)
train_data <- training(split)
test_data  <- testing(split)

# Step 2: Fit model on training data
model <- lm(y ~ ., data = train_data)

# Step 3: Predict on test data
y_pred <- predict(model, newdata = test_data)

# Step 4: Evaluate out-of-sample performance
rmse_val <- sqrt(mean((test_data$y - y_pred)^2))
cat("Training observations:", nrow(train_data), "\n")
cat("Test observations:    ", nrow(test_data), "\n")
cat("Test RMSE:            ", round(rmse_val, 4), "\n")
r2 <- 1 - sum((test_data$y - y_pred)^2) / sum((test_data$y - mean(test_data$y))^2)
cat("R-squared (test):     ", round(r2, 4), "\n")

Python Output

Training observations: 400
Test observations:     100
Test RMSE:             0.5182
R-squared (test):      0.9571

Stata Output

      Source |       SS           df       MS
-------------+----------------------------------
       Model |  2348.12741        10  234.812741
    Residual |   96.5488209      389  .248197483
-------------+----------------------------------
       Total |  2444.67623       399  6.12700810

Test RMSE: .50943
Training obs: 401
Test obs: 99

R Output

Training observations: 400
Test observations:     100
Test RMSE:             0.5237
R-squared (test):      0.9548

11.3 The Bias-Variance Tradeoff

The bias-variance tradeoff is arguably the single most important concept in machine learning. It explains why simple models sometimes outperform complex ones, why fitting your training data perfectly is usually a mistake, and how to think about the fundamental tension between underfitting and overfitting. Every ML method you will learn in this module — from LASSO to random forests to neural networks — can be understood as a particular way of navigating this tradeoff.

The key insight is that the expected prediction error for any model can be decomposed into three components:

E[(Y − f̂(X))²] = Bias²(f̂) + Var(f̂) + σ²

Bias measures how far off our model’s average prediction is from the truth. A model with high bias makes systematically wrong predictions because it is too simple to capture the true relationship. For instance, fitting a straight line to data that follows a curve produces high bias — the model simply cannot represent the underlying pattern, no matter how much data we give it. This is underfitting.

Variance measures how much our model’s predictions fluctuate across different training samples. A model with high variance is highly sensitive to the particular data points it was trained on. If you were to re-train the model on a slightly different random sample, the predictions would change dramatically. A 20th-degree polynomial that wiggles through every training point has high variance — it has essentially memorized the noise in that particular sample. This is overfitting.

The third term, σ², is the irreducible error — the inherent randomness in the data that no model can eliminate. Even with a perfect model and infinite data, you cannot predict an outcome perfectly if there is genuine noise in the system. This term sets a floor on how well any model can perform.

The fundamental tradeoff is this: as model complexity increases, bias decreases (the model can capture more complex patterns) but variance increases (the model becomes more sensitive to the training data). The optimal model is the one that minimizes the total error — the sweet spot where the sum of bias-squared and variance is at its minimum. Every ML method has one or more tuning parameters (hyperparameters) that control this tradeoff: the penalty parameter λ in LASSO, the maximum depth in decision trees, the number of hidden layers in a neural network.

The Bias-Variance Tradeoff

Prediction Error

Model Complexity →

Bias²

Variance

Total Error ↓ Sweet Spot

← Underfitting (too simple) Overfitting (too complex) →

Demonstrating Overfitting with Polynomial Fits

Let us make this concrete. We will generate data from a noisy sine curve, then fit polynomials of increasing degree (1, 5, and 15). You will see that the degree-1 polynomial underfits (too simple), the degree-15 polynomial overfits (wiggles through every point), and the degree-5 polynomial hits a reasonable middle ground.

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

np.random.seed(42)

# Generate noisy sine data
n_train, n_test = 30, 200
X_train = np.sort(np.random.uniform(0, 2 * np.pi, n_train)).reshape(-1, 1)
X_test  = np.linspace(0, 2 * np.pi, n_test).reshape(-1, 1)
y_train = np.sin(X_train).ravel() + np.random.randn(n_train) * 0.3
y_test  = np.sin(X_test).ravel() + np.random.randn(n_test) * 0.3

# Fit polynomials of degree 1, 5, and 15
print(f"{'Degree':<10} {'Train MSE':<15} {'Test MSE':<15}")
print("-" * 40)

for degree in [1, 5, 15]:
    poly = PolynomialFeatures(degree)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly  = poly.transform(X_test)

    model = LinearRegression()
    model.fit(X_train_poly, y_train)

    train_mse = mean_squared_error(y_train, model.predict(X_train_poly))
    test_mse  = mean_squared_error(y_test,  model.predict(X_test_poly))

    print(f"{degree:<10} {train_mse:<15.4f} {test_mse:<15.4f}")

* Generate noisy sine data
clear
set seed 42
set obs 230

* First 30 obs = training, rest = test
gen id = _n
gen test = (id > 30)
gen x = runiform() * 2 * _pi if test == 0
replace x = (id - 30) / 200 * 2 * _pi if test == 1
gen y = sin(x) + rnormal() * 0.3

* Generate polynomial terms
forvalues p = 2/15 {
    gen x`p' = x^`p'
}

* Fit polynomial regressions of increasing degree
display "Degree     Train MSE      Test MSE"
display "----------------------------------------"

* Degree 1
quietly regress y x if test == 0
predict yhat1
gen se1_train = (y - yhat1)^2 if test == 0
gen se1_test  = (y - yhat1)^2 if test == 1
quietly summarize se1_train
local tr1 = r(mean)
quietly summarize se1_test
display "1          " %9.4f `tr1' "      " %9.4f r(mean)

* Degree 5
quietly regress y x x2-x5 if test == 0
predict yhat5
gen se5_train = (y - yhat5)^2 if test == 0
gen se5_test  = (y - yhat5)^2 if test == 1
quietly summarize se5_train
local tr5 = r(mean)
quietly summarize se5_test
display "5          " %9.4f `tr5' "      " %9.4f r(mean)

* Degree 15
quietly regress y x x2-x15 if test == 0
predict yhat15
gen se15_train = (y - yhat15)^2 if test == 0
gen se15_test  = (y - yhat15)^2 if test == 1
quietly summarize se15_train
local tr15 = r(mean)
quietly summarize se15_test
display "15         " %9.4f `tr15' "      " %9.4f r(mean)

set.seed(42)

# Generate noisy sine data
n_train <- 30; n_test <- 200
x_train <- sort(runif(n_train, 0, 2 * pi))
x_test  <- seq(0, 2 * pi, length.out = n_test)
y_train <- sin(x_train) + rnorm(n_train, sd = 0.3)
y_test  <- sin(x_test) + rnorm(n_test, sd = 0.3)

# Fit polynomials of degree 1, 5, and 15
degrees <- c(1, 5, 15)
results <- data.frame(Degree = integer(), Train_MSE = numeric(), Test_MSE = numeric())

for (d in degrees) {
  # Fit polynomial regression
  model <- lm(y_train ~ poly(x_train, d, raw = TRUE))

  # Predict on train and test
  pred_train <- predict(model)
  pred_test  <- predict(model, newdata = data.frame(x_train = x_test))

  # Compute MSE
  train_mse <- mean((y_train - pred_train)^2)
  test_mse  <- mean((y_test - pred_test)^2)

  results <- rbind(results, data.frame(Degree = d, Train_MSE = train_mse, Test_MSE = test_mse))
}

print(results, row.names = FALSE)

Python Output

Degree     Train MSE       Test MSE
----------------------------------------
1          0.2194          0.2487
5          0.0621          0.0845
15         0.0298          0.5731

Stata Output

Degree     Train MSE      Test MSE
----------------------------------------
1             0.2237         0.2512
5             0.0654         0.0891
15            0.0312         0.5843

R Output

 Degree  Train_MSE  Test_MSE
      1    0.21506   0.24632
      5    0.06348   0.08714
     15    0.03012   0.56928

Notice the pattern: training MSE always decreases as the polynomial degree increases (the model fits the training data better and better). But test MSE first decreases (from degree 1 to degree 5) and then increases sharply (from degree 5 to degree 15). The degree-15 polynomial has essentially memorized the 30 training points, including their noise, and makes wild predictions on new data. This is overfitting in action, and it is precisely the bias-variance tradeoff at work.

11.4 Supervised vs. Unsupervised Learning

Machine learning methods are broadly divided into two families: supervised learning and unsupervised learning. The distinction is straightforward: in supervised learning, we have a labeled outcome variable Y that we want to predict from features X. In unsupervised learning, there is no outcome variable — we are simply looking for patterns and structure in the data. The name “supervised” comes from the idea that the labels “supervise” the learning process by telling the algorithm what the right answer is.

Within supervised learning, there are two main subtypes. Regression problems have a continuous outcome (predicting house prices, GDP growth, or exam scores). Classification problems have a categorical outcome (classifying emails as spam or not, predicting whether a loan will default, or identifying the sentiment of a text). The same algorithms (like decision trees or neural networks) can often handle both types, though the loss function and evaluation metrics differ. For regression, we typically use RMSE or MAE; for classification, accuracy, precision, recall, or AUC.

Accuracy is misleading with imbalanced classes

When one class dominates the data (e.g., 95% of loans do not default), a model that always predicts the majority class achieves 95% accuracy while being completely useless. For imbalanced classification, use precision (of those predicted positive, how many truly are), recall (of actual positives, how many were caught), or AUC (how well the model discriminates across all thresholds). See Module 11E: Model Evaluation for details.

Unsupervised learning is used when we want to discover structure without a specific prediction target. Clustering methods (like k-means) group similar observations together — for example, segmenting countries by economic characteristics or grouping consumers by purchasing behavior. Dimensionality reduction methods (like PCA) compress many variables into a smaller number of components that capture the most important variation. For economists, most ML applications in research are supervised — predicting outcomes, classifying text, or using ML predictions as inputs to causal estimation. However, unsupervised methods appear in text analysis (topic models), factor models, and exploratory data analysis.

	Supervised	Unsupervised
Goal	Predict Y from X	Find structure in X
Data	Labeled (X, Y)	Unlabeled (X only)
Examples	LASSO, Random Forest, Neural Nets	K-means, PCA, Topic Models
Evaluation	Compare predictions to known labels	Internal metrics (silhouette, variance explained)
Economics use	Prediction policy, causal ML	Text clustering, factor models

11.5 Cross-Validation & Model Selection

We have established that out-of-sample performance is what matters for prediction. But a single train/test split has two problems. First, it wastes data — the observations in the test set are never used for training. Second, the results depend on the particular random split: you might get a favorable split where the test set happens to be easy, or an unfavorable one where it is unusually hard. K-fold cross-validation solves both problems by systematically rotating which data is used for training and testing.

The procedure works as follows. Divide your data into K equally sized “folds” (groups). For each fold k = 1, ..., K: train the model on all data except fold k, then evaluate it on fold k. After all K iterations, every observation has been used exactly once as a test point. The average of the K test errors gives you a robust estimate of out-of-sample performance. Typical choices are K = 5 or K = 10. The extreme case K = N (called leave-one-out cross-validation, or LOOCV) trains the model N times, each time leaving out a single observation. LOOCV has the lowest bias but can be computationally expensive and sometimes has higher variance than K = 10.

5-Fold Cross-Validation

Iter 1

Test

Train

Iter 2

Train

Test

Train

Iter 3

Train

Test

Train

Iter 4

Train

Test

Train

Iter 5

Train

Test

Each observation is used as a test point exactly once. The final CV score is the average of all 5 test errors.

Cross-validation serves two critical purposes. First, it gives you a reliable estimate of how your model will perform on truly new data. Second, it allows you to choose hyperparameters — settings that control the model’s complexity. For instance, in LASSO regression (Module 11A), the penalty parameter λ controls how much regularization to apply. You can fit the LASSO for many different values of λ, compute the CV error for each, and choose the λ that minimizes the CV error. This data-driven approach to model selection is far more principled than eyeballing the results or relying on arbitrary rules of thumb.

An important practical point: always do your cross-validation before looking at the test set. The test set should be touched only once, at the very end, to report your model’s final performance. If you repeatedly evaluate on the test set and adjust your model, you are effectively training on the test set, and your reported performance will be overly optimistic. Some practitioners advocate for a three-way split: training set (for fitting), validation set (for hyperparameter tuning), and test set (for final evaluation). Cross-validation on the training set replaces the need for a separate validation set.

import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression

# Use the same simulated data
np.random.seed(42)
n = 500
X = np.random.randn(n, 10)
y = 2 * X[:, 0] + 1.5 * X[:, 1] - 1 * X[:, 2] + np.random.randn(n) * 0.5

# Set up 5-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = LinearRegression()

# cross_val_score returns negative MSE (sklearn convention: higher = better)
scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')

# Convert to positive RMSE
rmse_scores = np.sqrt(-scores)

print("RMSE for each fold:")
for i, score in enumerate(rmse_scores, 1):
    print(f"  Fold {i}: {score:.4f}")
print(f"\nMean RMSE:  {rmse_scores.mean():.4f}")
print(f"Std RMSE:   {rmse_scores.std():.4f}")

* Manual 5-fold cross-validation in Stata
clear
set seed 42
set obs 500

* Generate data
forvalues i = 1/10 {
    gen x`i' = rnormal()
}
gen y = 2*x1 + 1.5*x2 - 1*x3 + rnormal()*0.5

* Assign observations to 5 folds randomly
gen u = runiform()
sort u
gen fold = mod(_n - 1, 5) + 1

* Perform 5-fold CV
gen cv_resid_sq = .

forvalues k = 1/5 {
    quietly regress y x1-x10 if fold != `k'
    predict yhat_temp if fold == `k'
    replace cv_resid_sq = (y - yhat_temp)^2 if fold == `k'
    drop yhat_temp
}

* Report RMSE per fold and overall
display "RMSE for each fold:"
forvalues k = 1/5 {
    quietly summarize cv_resid_sq if fold == `k'
    display "  Fold `k': " %7.4f sqrt(r(mean))
}
quietly summarize cv_resid_sq
display _newline "Mean RMSE: " %7.4f sqrt(r(mean))

library(rsample)

# Generate data
set.seed(42)
n <- 500
X <- matrix(rnorm(n * 10), ncol = 10)
colnames(X) <- paste0("x", 1:10)
y <- 2 * X[, 1] + 1.5 * X[, 2] - 1 * X[, 3] + rnorm(n) * 0.5
df <- data.frame(y = y, X)

# Create 5-fold CV splits
folds <- vfold_cv(df, v = 5)

# Compute RMSE for each fold
rmse_vals <- sapply(1:5, function(i) {
  train_data <- analysis(folds$splits[[i]])
  test_data  <- assessment(folds$splits[[i]])
  model <- lm(y ~ ., data = train_data)
  preds <- predict(model, newdata = test_data)
  sqrt(mean((test_data$y - preds)^2))
})

cat("RMSE for each fold:\n")
for (i in 1:5) {
  cat(sprintf("  Fold %d: %.4f\n", i, rmse_vals[i]))
}
cat(sprintf("\nMean RMSE:  %.4f\n", mean(rmse_vals)))
cat(sprintf("Std RMSE:   %.4f\n", sd(rmse_vals)))

Python Output

RMSE for each fold:
  Fold 1: 0.4973
  Fold 2: 0.5241
  Fold 3: 0.5108
  Fold 4: 0.4856
  Fold 5: 0.5322

Mean RMSE:  0.5100
Std RMSE:   0.0173

Stata Output

RMSE for each fold:
  Fold 1:  0.5012
  Fold 2:  0.5198
  Fold 3:  0.5067
  Fold 4:  0.4921
  Fold 5:  0.5284

Mean RMSE:  0.5096

R Output

RMSE for each fold:
  Fold 1: 0.5031
  Fold 2: 0.5187
  Fold 3: 0.5095
  Fold 4: 0.4889
  Fold 5: 0.5299

Mean RMSE:  0.5100
Std RMSE:   0.0139

11.6 Module Roadmap

This module is organized into five subpages, each covering a major family of ML methods. Work through them in order, as later sections build on concepts from earlier ones.

11A: Regularization

Ridge, LASSO, Elastic Net, Post-LASSO for causal inference

~2.5 hours 11B: Tree-Based Methods

Decision Trees, Random Forests, Gradient Boosting (XGBoost)

~2.5 hours 11C: Neural Networks

Perceptrons, backpropagation, deep learning fundamentals

~2.5 hours 11D: Causal ML

Double/Debiased ML, Causal Forests, heterogeneous treatment effects

~2.5 hours 11E: Model Evaluation

Metrics, confusion matrices, ROC curves, calibration, comparing models

~2 hours

Comparing the Major Method Families

Regularization (11A)

Linear models with penalty terms
Fast, interpretable, well-understood theory
Best when relationship is approximately linear
LASSO provides automatic variable selection

Trees (11B)

Nonparametric, capture interactions naturally
Random forests and boosting are very powerful
No need to specify functional form
Often best “out-of-the-box” performance

Neural Networks (11C)

Universal function approximators
Excel with unstructured data (text, images)
Require large datasets and careful tuning
Less common in applied economics (so far)

References

Mullainathan, S. & Spiess, J. (2017). “Machine Learning: An Applied Econometric Approach.” Journal of Economic Perspectives, 31(2), 87–106.
Athey, S. & Imbens, G. (2019). “Machine Learning Methods That Economists Should Know About.” Annual Review of Economics, 11, 685–725.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. 2nd ed. Springer.

ProTools ER1

Course Modules

11 Machine Learning

Learning Objectives

Prerequisites

In This Module

11.1 Why Machine Learning for Economists?

11.2 Prediction vs. Causation

A Basic Prediction Pipeline

11.3 The Bias-Variance Tradeoff

The Bias-Variance Tradeoff

Demonstrating Overfitting with Polynomial Fits

11.4 Supervised vs. Unsupervised Learning

11.5 Cross-Validation & Model Selection

5-Fold Cross-Validation

11.6 Module Roadmap

Comparing the Major Method Families

Regularization (11A)

Trees (11B)

Neural Networks (11C)

References

ProTools ER1 Assistant