11E Model Evaluation

~2 hours Validation Strategies, Regression Metrics, Classification Metrics, Model Selection

Learning Objectives

Understand why evaluating on training data is misleading and how overfitting distorts performance estimates
Implement train-test splits and K-fold cross-validation in Python, Stata, and R
Compute and interpret regression metrics: RMSE, MAE, R-squared, and MAPE
Compute and interpret classification metrics: accuracy, precision, recall, F1, and AUC-ROC
Recognize and prevent data leakage -- the most common source of invalid ML results
Use learning curves to diagnose underfitting vs. overfitting
Use proper evaluation strategies to compare and select among competing models

Why Evaluation Matters
Train-Test Split
K-Fold Cross-Validation
Regression Metrics
Classification Metrics
AUC-ROC and Threshold-Independent Evaluation
Hyperparameter Tuning
Overfitting Diagnostics
Data Leakage
Model Selection and Comparison

Why Evaluation Matters

Every model you have encountered in this course -- from OLS regression to LASSO, random forests, and neural networks -- needs to be evaluated. But evaluated on what? The answer to this question is one of the most important ideas in all of machine learning, and it separates rigorous empirical work from misleading results.

The central problem is overfitting. A sufficiently flexible model -- a deep neural network, a random forest with many trees, or even an OLS regression with hundreds of interaction terms -- can always achieve excellent performance on the data it was trained on. It does this by memorizing the noise, the outliers, and the quirks of that specific dataset. The model's apparent "accuracy" is an illusion: it reflects the model's ability to recall the training data, not its ability to predict new, unseen observations.

The Exam Analogy

Think of model evaluation like preparing for an exam. A student who memorizes every question and answer from past exams (training data) will score perfectly on those same exams -- but may fail completely when given new questions (test data) that require genuine understanding. The only honest measure of learning is performance on questions the student has never seen before. Machine learning models work the same way.

Consider a simple example. Suppose you want to predict GDP growth using 20 macroeconomic indicators, and your dataset contains 50 country-year observations. If you fit a polynomial regression with enough interaction terms, you can achieve an R-squared of 1.0 on the training data -- a "perfect" fit. But this model has essentially memorized your 50 data points. Give it data from a new year or a new country, and its predictions will be wildly inaccurate. The training-set R-squared of 1.0 told you nothing useful about the model's real predictive ability.

The Cardinal Rule of Model Evaluation

Never evaluate a model on the same data used to train it. Training-set performance is always optimistically biased. The more flexible the model, the larger the gap between training performance and true out-of-sample performance. This is not a minor technicality -- it is the difference between a model that works and one that only appears to work.

The solution is conceptually simple: set aside data that the model has never seen during training, and evaluate the model's predictions on that held-out data. The specific strategies for doing this -- train-test splits and cross-validation -- are the subject of the next two sections. But the underlying principle is always the same: the only honest measure of a model's quality is its performance on data it has not been trained on.

This principle matters just as much in economics as in computer science. If you are building a model to predict firm bankruptcy, forecast inflation, or identify households eligible for a targeted policy, you need to know how well it will perform when deployed on new data. Reporting only training-set performance is analogous to judging a student by giving them the exact same exam they used to study -- it tells you about memorization, not understanding.

Train-Test Split

The simplest evaluation strategy is to split your data into two parts: a training set used to fit the model, and a test set (sometimes called a "hold-out set") used to evaluate it. The model never sees the test data during training -- it is kept completely separate, like sealing an exam in an envelope until the student has finished studying.

A typical split is 80% training and 20% test, though the exact ratio depends on the size of your dataset. With very large datasets (hundreds of thousands of observations), even a 90/10 split leaves plenty of test data. With small datasets (a few hundred observations), you may need cross-validation instead (covered in the next section), because holding out 20% leaves too few observations for both reliable training and reliable evaluation.

Two important implementation details. First, the split should be random -- if your data is sorted (by time, by country, by outcome), a non-random split would systematically exclude certain types of observations from training or testing, leading to biased results. Second, you should set a random seed so that your results are reproducible. Anyone running your code should get the same split and therefore the same results.

When Random Splits Are Wrong

If your data has a time dimension (panel data, time series), a random split can leak future information into the training set. For example, if you randomly split yearly GDP data, the model might train on 2020 data and predict 2018 -- it has seen the future. In these cases, use a temporal split: train on earlier periods and test on later ones. This mimics the real prediction task, where you only have past data available.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

# Load California Housing data
housing = fetch_california_housing(as_frame=True)
X = housing.data       # 20,640 observations, 8 features
y = housing.target     # Median house value (in $100k)

# Split: 80% training, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

print(f"Full dataset:    {X.shape[0]:,} observations")
print(f"Training set:    {X_train.shape[0]:,} observations ({X_train.shape[0]/X.shape[0]:.0%})")
print(f"Test set:        {X_test.shape[0]:,} observations ({X_test.shape[0]/X.shape[0]:.0%})")
print(f"\nFeatures:        {X.shape[1]}")
print(f"Target mean:     {y.mean():.3f} (train: {y_train.mean():.3f}, test: {y_test.mean():.3f})")

* Stata: Train-test split using random number generation
* Load example data
sysuse auto, clear

* Set seed for reproducibility
set seed 42

* Generate a uniform random variable and split at 80th percentile
gen rand = runiform()
gen byte split = cond(rand <= 0.8, 1, 0)
label define splitlbl 1 "Train" 0 "Test"
label values split splitlbl

* Verify the split
tab split

* Summarize the outcome in each split
tabstat price, by(split) stat(n mean sd min max) format(%9.1f)

* Fit model on training data only
reg price mpg weight length if split == 1

* Generate predictions for ALL observations (train + test)
predict yhat

* Evaluate on TEST set only
gen resid_sq = (price - yhat)^2 if split == 0
quietly summarize resid_sq
display "Test RMSE: " sqrt(r(mean))

# R: Train-test split using rsample (tidymodels framework)
library(rsample)

# Load built-in mtcars data (32 observations, 10 features)
data(mtcars)

# Create an 80/20 split (stratified splits are also possible)
set.seed(42)
split_obj <- initial_split(mtcars, prop = 0.8)

# Extract training and test sets
train_data <- training(split_obj)
test_data  <- testing(split_obj)

cat("Full dataset:   ", nrow(mtcars), "observations\n")
cat("Training set:   ", nrow(train_data), "observations",
    sprintf("(%0.0f%%)\n", 100 * nrow(train_data) / nrow(mtcars)))
cat("Test set:       ", nrow(test_data), "observations",
    sprintf("(%0.0f%%)\n", 100 * nrow(test_data) / nrow(mtcars)))
cat("\nTarget (mpg) mean -- train:", round(mean(train_data$mpg), 2),
    "  test:", round(mean(test_data$mpg), 2), "\n")

Python Output

Full dataset:    20,640 observations
Training set:    16,512 observations (80%)
Test set:        4,128 observations (20%)

Features:        8
Target mean:     2.069 (train: 2.072, test: 2.055)

Stata Output

     split |      Freq.     Percent        Cum.
-----------+-----------------------------------
     Train |         59       79.73       79.73
      Test |         15       20.27      100.00
-----------+-----------------------------------
     Total |         74      100.00

      split |         N      Mean       S.D.       Min       Max
  ----------+-----------------------------------------------------
      Train |        59    6165.3    2949.5     3291    15906
       Test |        15    6229.9    3217.1     3748    14500
  ----------+-----------------------------------------------------
      Total |        74    6178.4    2993.0     3291    15906

Test RMSE: 2461.38

R Output

Full dataset:    32 observations
Training set:    25 observations (78%)
Test set:        7 observations (22%)

Target (mpg) mean -- train: 19.96   test: 20.93

K-Fold Cross-Validation

A single train-test split has an obvious limitation: the evaluation depends on which observations happened to land in the test set. With a small dataset, a different random split could give substantially different results. You might get lucky (easy test observations) or unlucky (hard ones), and your performance estimate will be noisy.

The Practice Exam Analogy

Cross-validation is like studying with practice exams. Instead of taking one practice test and hoping it covers the right material, you take five different practice tests, each covering a different subset of the material. Your average score across all five gives a much more reliable estimate of how you will perform on the real exam than any single practice test alone.

K-fold cross-validation solves this by using every observation for both training and testing, just not at the same time. The procedure is straightforward: divide the data into K equally-sized subsets (called "folds"). Then, for each fold in turn, train the model on the other K-1 folds and evaluate it on the held-out fold. After K rounds, every observation has been in the test set exactly once. The final performance metric is the average across all K folds.

The most common choice is K = 5 or K = 10. With 5-fold CV, each fold contains 20% of the data, and the model is trained 5 times on 80% of the data. This gives a more reliable performance estimate than a single split, because it averages over 5 different train-test configurations. The cost is computational: you must train the model K times instead of once.

5-Fold Cross-Validation

Each observation appears in the test set exactly once. The final score is the average across all folds -- a more reliable estimate than any single split.

An important variant is stratified K-fold cross-validation, which ensures that each fold has roughly the same distribution of the target variable. This is particularly important for classification problems with imbalanced classes -- if only 5% of firms in your dataset go bankrupt, a non-stratified split might create a fold with no bankrupt firms at all, making evaluation meaningless.

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Define two competing models
ols   = LinearRegression()
rf    = RandomForestRegressor(n_estimators=100, random_state=42)

# 5-fold cross-validation for both models
# scoring="r2" computes R-squared; scoring="neg_mean_squared_error" gives -MSE
ols_scores = cross_val_score(ols, X, y, cv=5, scoring="r2")
rf_scores  = cross_val_score(rf,  X, y, cv=5, scoring="r2")

print("5-Fold Cross-Validation: R-squared")
print("=" * 45)
print(f"OLS Regression:")
print(f"  Per-fold scores: {ols_scores.round(3)}")
print(f"  Mean: {ols_scores.mean():.3f}  (Std: {ols_scores.std():.3f})")
print(f"\nRandom Forest:")
print(f"  Per-fold scores: {rf_scores.round(3)}")
print(f"  Mean: {rf_scores.mean():.3f}  (Std: {rf_scores.std():.3f})")

* Stata: K-fold cross-validation using the crossfold command
* Install if needed: ssc install crossfold

sysuse auto, clear
set seed 42

* crossfold performs k-fold CV for regression models
* It reports the out-of-sample RMSE for each fold
crossfold reg price mpg weight length, k(5)

* The output shows RMSE for each fold and the overall average
* You can also do this manually for more control:

* Manual 5-fold CV
gen fold = ceil(5 * runiform())

forvalues k = 1/5 {
    quietly reg price mpg weight length if fold != `k'
    quietly predict double yhat_`k' if fold == `k'
    quietly gen double se_`k' = (price - yhat_`k')^2 if fold == `k'
    quietly summarize se_`k'
    display "Fold `k': RMSE = " sqrt(r(mean))
}

* Compute overall CV RMSE
egen all_se = rowtotal(se_1-se_5), missing
quietly summarize all_se
display "Overall CV RMSE: " sqrt(r(mean))

# R: K-fold cross-validation using tidymodels
library(rsample)
library(parsnip)
library(yardstick)
library(tune)
library(workflows)

# Load data
data(mtcars)

# Create 5-fold cross-validation object
set.seed(42)
folds <- vfold_cv(mtcars, v = 5)

# Define two models
lm_spec <- linear_reg() |> set_engine("lm")
rf_spec <- rand_forest(trees = 100) |>
  set_engine("ranger") |>
  set_mode("regression")

# Create workflows (model + formula)
lm_wf <- workflow() |> add_model(lm_spec) |> add_formula(mpg ~ .)
rf_wf <- workflow() |> add_model(rf_spec) |> add_formula(mpg ~ .)

# Fit both models using 5-fold CV
lm_results <- fit_resamples(lm_wf, resamples = folds, metrics = metric_set(rsq, rmse))
rf_results <- fit_resamples(rf_wf, resamples = folds, metrics = metric_set(rsq, rmse))

# Collect and display average metrics
cat("=== OLS Regression ===\n")
print(collect_metrics(lm_results))
cat("\n=== Random Forest ===\n")
print(collect_metrics(rf_results))

Python Output

5-Fold Cross-Validation: R-squared
=============================================
OLS Regression:
  Per-fold scores: [0.596 0.609 0.593 0.607 0.592]
  Mean: 0.599  (Std: 0.007)

Random Forest:
  Per-fold scores: [0.805 0.826 0.812 0.817 0.809]
  Mean: 0.814  (Std: 0.007)

Stata Output

Fold 1:  Root Mean Squared Error =  2372.019
Fold 2:  Root Mean Squared Error =  2781.445
Fold 3:  Root Mean Squared Error =  2196.584
Fold 4:  Root Mean Squared Error =  2534.717
Fold 5:  Root Mean Squared Error =  2688.103

Fold 1: RMSE = 2410.52
Fold 2: RMSE = 2693.18
Fold 3: RMSE = 2247.91
Fold 4: RMSE = 2581.36
Fold 5: RMSE = 2732.40
Overall CV RMSE: 2541.37

R Output

=== OLS Regression ===
# A tibble: 2 x 6
  .metric .estimator  mean     n std_err .config
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>
1 rmse    standard   3.46      5  0.485  Preprocessor1_Model1
2 rsq     standard   0.773     5  0.0631 Preprocessor1_Model1

=== Random Forest ===
# A tibble: 2 x 6
  .metric .estimator  mean     n std_err .config
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>
1 rmse    standard   2.87      5  0.398  Preprocessor1_Model1
2 rsq     standard   0.843     5  0.0412 Preprocessor1_Model1

Regression Metrics

Once you have a held-out test set (or cross-validation folds), you need a metric to quantify how well the model's predictions match the actual values. For regression problems -- where the outcome is a continuous variable like income, GDP, or house prices -- the most commonly used metrics are RMSE, MAE, R-squared, and MAPE. Each measures a different aspect of prediction quality, and the right choice depends on your application.

What are regression metrics for?

Regression metrics answer one question: how far off are my predictions from reality? If you predict that a house is worth $300,000 and it sells for $350,000, your error is $50,000. Regression metrics aggregate these individual errors across all test observations into a single number that summarizes the model's overall prediction quality. Different metrics emphasize different things -- some penalize big mistakes more, some are easier to interpret, and some allow comparison across different scales.

Root Mean Squared Error (RMSE)

RMSE is the most widely used regression metric. It computes the square root of the average squared prediction error:

RMSE = sqrt( (1/n) * sum( (y_i - y_hat_i)^2 ) )

RMSE is in the same units as the outcome variable -- if you are predicting house prices in dollars, the RMSE is also in dollars. This makes it directly interpretable: an RMSE of $25,000 means the model's predictions are typically about $25,000 away from the true values. Because it squares the errors before averaging, RMSE penalizes large errors disproportionately. A single prediction that is off by $100,000 contributes more to the RMSE than ten predictions that are each off by $10,000. This is often desirable -- in many applications, big mistakes are much worse than small ones.

Mean Absolute Error (MAE)

MAE takes a simpler approach -- it averages the absolute prediction errors without squaring:

MAE = (1/n) * sum( |y_i - y_hat_i| )

MAE is also in the same units as the outcome and is easier to interpret: it is literally the average size of the prediction errors. Unlike RMSE, MAE treats all errors equally -- a $100,000 error contributes exactly 10 times as much as a $10,000 error. MAE is more robust to outliers than RMSE. If your data has a few extreme values that are hard to predict, RMSE will be dominated by those outliers, while MAE gives you a better sense of typical prediction quality.

R-squared (Coefficient of Determination)

R-squared measures the proportion of variance in the outcome that is explained by the model's predictions:

R-squared = 1 - ( sum( (y_i - y_hat_i)^2 ) / sum( (y_i - y_bar)^2 ) )

R-squared is unitless and ranges from negative infinity to 1. An R-squared of 1.0 means perfect prediction. An R-squared of 0.0 means the model does no better than simply predicting the mean of y for every observation. An R-squared below zero means the model is worse than predicting the mean -- this can happen with out-of-sample evaluation when the model has overfit badly. The key advantage of R-squared is that it is scale-free -- you can compare R-squared values across different outcome variables (e.g., predicting income vs. predicting GDP growth) in a way that RMSE does not allow.

Mean Absolute Percentage Error (MAPE)

MAPE expresses errors as a percentage of the actual values:

MAPE = (100/n) * sum( |y_i - y_hat_i| / |y_i| )

MAPE is intuitive -- a MAPE of 8% means predictions are off by about 8% on average. However, MAPE has a serious limitation: it is undefined when y_i = 0 and becomes extremely large when y_i is close to zero. For this reason, MAPE works best when the outcome variable is strictly positive and does not have values near zero (e.g., GDP, house prices, firm revenue).

When to Use RMSE

Large errors are disproportionately costly
You want a metric in the same units as the outcome
Standard default for most ML applications
Example: predicting house prices where a $200k miss is far worse than four $50k misses

When to Use MAE

All error sizes are equally important
Data contains outliers or extreme values
You want a robust, easily interpretable metric
Example: predicting delivery times where all delays matter equally

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    r2_score,
    mean_absolute_percentage_error
)

# Load data and split
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)

# Train a random forest
rf = RandomForestRegressor(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)

# Generate predictions on the TEST set
y_pred = rf.predict(X_test)

# Compute all regression metrics
rmse  = mean_squared_error(y_test, y_pred, squared=False)
mae   = mean_absolute_error(y_test, y_pred)
r2    = r2_score(y_test, y_pred)
mape  = mean_absolute_percentage_error(y_test, y_pred)

print("Regression Metrics (Test Set)")
print("=" * 40)
print(f"RMSE:       {rmse:.4f}")
print(f"MAE:        {mae:.4f}")
print(f"R-squared:  {r2:.4f}")
print(f"MAPE:       {mape:.2%}")
print(f"\nInterpretation:")
print(f"  Predictions are off by ~${mae * 100000:,.0f} on average (MAE)")
print(f"  The model explains {r2:.1%} of variance in house values")

* Stata: Computing regression metrics manually on the test set
sysuse auto, clear
set seed 42

* Create train/test split
gen rand = runiform()
gen byte is_test = (rand > 0.8)

* Fit model on training data
quietly reg price mpg weight length if is_test == 0
predict double yhat

* Compute metrics on TEST set only
gen double error    = price - yhat         if is_test == 1
gen double error_sq = error^2               if is_test == 1
gen double error_ab = abs(error)            if is_test == 1
gen double pct_err  = abs(error / price)    if is_test == 1

* RMSE
quietly summarize error_sq
scalar rmse = sqrt(r(mean))

* MAE
quietly summarize error_ab
scalar mae = r(mean)

* R-squared (out-of-sample)
quietly summarize price if is_test == 1
scalar ybar = r(mean)
gen double ss_tot = (price - ybar)^2 if is_test == 1
quietly summarize ss_tot
scalar ss_total = r(sum)
quietly summarize error_sq
scalar ss_res = r(sum)
scalar r2_oos = 1 - ss_res / ss_total

* MAPE
quietly summarize pct_err
scalar mape = r(mean) * 100

* Display results
display "Regression Metrics (Test Set)"
display "========================================"
display "RMSE:       " %9.2f rmse
display "MAE:        " %9.2f mae
display "R-squared:  " %9.4f r2_oos
display "MAPE:       " %9.2f mape "%"

# R: Computing regression metrics with yardstick
library(rsample)
library(randomForest)
library(yardstick)

# Load data and split
data(mtcars)
set.seed(42)
split <- initial_split(mtcars, prop = 0.8)
train <- training(split)
test  <- testing(split)

# Train a random forest on the training set
rf_model <- randomForest(mpg ~ ., data = train, ntree = 200)

# Predict on the TEST set
test$predicted <- predict(rf_model, newdata = test)

# Compute metrics manually
results <- test |>
  summarise(
    RMSE  = sqrt(mean((mpg - predicted)^2)),
    MAE   = mean(abs(mpg - predicted)),
    R2    = 1 - sum((mpg - predicted)^2) / sum((mpg - mean(mpg))^2),
    MAPE  = mean(abs((mpg - predicted) / mpg)) * 100
  )

# Display results
cat("Regression Metrics (Test Set)\n")
cat("========================================\n")
cat(sprintf("RMSE:       %.4f\n", results$RMSE))
cat(sprintf("MAE:        %.4f\n", results$MAE))
cat(sprintf("R-squared:  %.4f\n", results$R2))
cat(sprintf("MAPE:       %.2f%%\n", results$MAPE))

Python Output

Regression Metrics (Test Set)
========================================
RMSE:       0.5010
MAE:        0.3272
R-squared:  0.8098
MAPE:       17.25%

Interpretation:
  Predictions are off by ~$32,720 on average (MAE)
  The model explains 81.0% of variance in house values

Stata Output

Regression Metrics (Test Set)
========================================
RMSE:        2461.38
MAE:         1894.52
R-squared:     0.3271
MAPE:         29.87%

R Output

Regression Metrics (Test Set)
========================================
RMSE:       2.7134
MAE:        2.1026
R-squared:  0.8523
MAPE:       10.84%

Classification Metrics

When the outcome is categorical -- whether a firm defaults, whether a household participates in a program, whether an email is spam -- the evaluation metrics are fundamentally different from regression. Prediction errors are no longer continuous deviations; they are discrete mistakes: the model said "yes" but the truth was "no," or vice versa. Classification metrics quantify these discrete errors in different ways, and choosing the right metric is critical because they often disagree about which model is "best."

What are classification metrics for?

Classification metrics answer: how often does the model get the category right, and what kind of mistakes does it make? Unlike regression where errors are "how far off," classification errors come in distinct flavors -- and the cost of each flavor depends entirely on the application. Flagging a legitimate email as spam (false positive) is annoying; missing a fraudulent transaction (false negative) can cost thousands of dollars. The metrics below let you measure and compare these different types of errors.

Economics Example: The Accuracy Trap

When predicting loan defaults (suppose only 1% of loans default), accuracy is useless. A model that predicts "no default" for every single loan achieves 99% accuracy while being completely useless for the task -- it never identifies any defaults. This is why you need precision, recall, and F1: they measure how well the model handles the rare but important class. In economics, the interesting class is almost always the rare one: firm bankruptcy, program take-up, policy violations, financial crises.

The foundation of all classification metrics is the confusion matrix. For a binary classification problem, it is a 2x2 table with four cells:

Confusion Matrix

Green cells = correct predictions. Red cells = errors. Every classification metric is computed from these four counts.

Accuracy

Accuracy is the simplest metric: the fraction of all predictions that are correct.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Accuracy is intuitive but deeply misleading with imbalanced classes. As shown in the loan default example above, a model that always predicts the majority class can achieve very high accuracy while being completely useless. For this reason, accuracy should almost never be used as the primary metric for imbalanced classification problems -- and in economics, most interesting classification tasks are imbalanced.

Precision and Recall

The Spam Filter Analogy

Precision vs. recall is like a spam filter: high precision means almost everything the filter flags IS actually spam, but you might miss some spam in your inbox. High recall means the filter catches almost ALL spam, but some real emails get flagged too. You cannot maximize both at once -- there is always a tradeoff, and the right balance depends on how costly each type of mistake is.

Precision answers: "Of all observations the model labeled as positive, how many actually were positive?" It measures the reliability of positive predictions.

Precision = TP / (TP + FP)

Recall (also called sensitivity or true positive rate) answers: "Of all observations that were actually positive, how many did the model identify?" It measures the completeness of positive detection.

Recall = TP / (TP + FN)

There is an inherent trade-off between precision and recall. Lowering the classification threshold (being more willing to predict "positive") increases recall but decreases precision, and vice versa. The right balance depends on the costs of each type of error.

Precision-Recall Tradeoff

As you raise the threshold, precision increases (fewer false alarms) but recall decreases (more misses). The F1 score balances the two.

F1 Score

The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

F1 ranges from 0 to 1. It equals 1 only when both precision and recall are perfect. The harmonic mean penalizes imbalance: if either precision or recall is very low, F1 will be low even if the other is high. This makes F1 a better summary metric than accuracy for imbalanced problems.

Precision-Focused Problems

Spam filtering (users hate false alarms)
Drug approval (false positives have high costs)
Targeting scarce program resources
When acting on a prediction is expensive

Recall-Focused Problems

Fraud detection (missing fraud is very costly)
Disease screening (missing cases is dangerous)
Identifying at-risk students or firms
When missing a positive case is the worst outcome

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    accuracy_score
)

# Load breast cancer data (binary classification)
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42,
    stratify=data.target
)

# Train a random forest classifier
clf = RandomForestClassifier(n_estimators=200, random_state=42)
clf.fit(X_train, y_train)

# Hard predictions and predicted probabilities
y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1]  # Probability of positive class

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(f"  TN={cm[0,0]}  FP={cm[0,1]}")
print(f"  FN={cm[1,0]}  TP={cm[1,1]}")

# Full classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred,
      target_names=["Malignant", "Benign"]))

# AUC-ROC (uses probabilities, not hard labels)
auc = roc_auc_score(y_test, y_prob)
print(f"AUC-ROC: {auc:.4f}")

* Stata: Classification metrics after logistic regression
webuse lbw, clear

* Binary outcome: low birth weight (low = 1 if birth weight < 2500g)
set seed 42
gen rand = runiform()
gen byte is_test = (rand > 0.8)

* Fit logistic regression on training data
logit low age lwt i.race smoke ptl ht ui if is_test == 0

* Predict probabilities for all observations
predict double phat, pr

* Classify using 0.5 threshold
gen byte y_pred = (phat >= 0.5)

* Confusion matrix on test set
display "Confusion Matrix (Test Set):"
tab low y_pred if is_test == 1

* estat classification gives sensitivity, specificity, etc.
quietly logit low age lwt i.race smoke ptl ht ui if is_test == 0
estat classification if is_test == 1

* ROC curve and AUC
lroc if is_test == 1, nograph

# R: Classification metrics using yardstick
library(rsample)
library(parsnip)
library(yardstick)

# Load data (built-in two-class dataset)
data(two_class_example, package = "yardstick")

# Compute confusion matrix
cat("Confusion Matrix:\n")
print(conf_mat(two_class_example, truth = truth, estimate = predicted))

# Compute all key classification metrics
metrics_result <- two_class_example |>
  summarise(
    Accuracy  = accuracy_vec(truth, predicted),
    Precision = precision_vec(truth, predicted),
    Recall    = recall_vec(truth, predicted),
    F1        = f_meas_vec(truth, predicted)
  )

cat("\nClassification Metrics:\n")
cat(sprintf("  Accuracy:  %.4f\n", metrics_result$Accuracy))
cat(sprintf("  Precision: %.4f\n", metrics_result$Precision))
cat(sprintf("  Recall:    %.4f\n", metrics_result$Recall))
cat(sprintf("  F1 Score:  %.4f\n", metrics_result$F1))

# AUC-ROC (requires predicted probabilities)
auc_val <- roc_auc_vec(
  two_class_example$truth,
  two_class_example$Class1
)
cat(sprintf("  AUC-ROC:   %.4f\n", auc_val))

Python Output

Confusion Matrix:
  TN=40  FP=3
  FN=2   TP=69

Classification Report:
              precision    recall  f1-score   support

   Malignant       0.95      0.93      0.94        43
      Benign       0.96      0.97      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

AUC-ROC: 0.9952

Stata Output

Confusion Matrix (Test Set):
           |    y_pred
       low |         0          1 |     Total
-----------+----------------------+----------
         0 |        23          4 |        27
         1 |         5          6 |        11
-----------+----------------------+----------
     Total |        28         10 |        38

Logistic model for low
              -------- True --------
Classified |         D            ~D  |      Total
-----------+--------------------------+-----------
     +     |         6             4  |         10
     -     |         5            23  |         28
-----------+--------------------------+-----------
   Total   |        11            27  |         38

Sensitivity                  Pr( +| D)   54.55%
Specificity                  Pr( -|~D)   85.19%
Positive predictive value    Pr( D| +)   60.00%
Negative predictive value    Pr(~D| -)   82.14%
Correctly classified                     76.32%

Area under ROC curve = 0.7828

R Output

Confusion Matrix:
            Truth
Prediction Class1 Class2
    Class1    227     27
    Class2     23    223

Classification Metrics:
  Accuracy:  0.9000
  Precision: 0.8937
  Recall:    0.9080
  F1 Score:  0.9008
  AUC-ROC:   0.9406

AUC-ROC and Threshold-Independent Evaluation

The Card-Sorting Analogy

AUC-ROC is like measuring how good you are at sorting a shuffled deck of red and black cards back into the right piles. If you are perfect, every red card gets placed before every black card (AUC = 1.0). If you sort randomly, you do no better than chance (AUC = 0.5). The AUC measures your ranking ability -- whether you can consistently place the positives ahead of the negatives -- regardless of where you draw the cutoff line between the two piles.

The AUC-ROC evaluates the model's ability to discriminate between positive and negative classes across all possible thresholds, rather than at a single threshold. The ROC curve plots the true positive rate (recall) against the false positive rate at every possible classification threshold. The area under this curve (AUC) summarizes overall discrimination ability.

AUC-ROC is particularly useful when you have not yet decided on a classification threshold, or when you want to compare models independently of threshold choice. A model with a higher AUC is better at ranking positive cases above negative cases, regardless of where you eventually draw the line.

ROC Curves: Good Model vs. Random vs. Perfect

The further the curve bows toward the top-left corner, the better the model. The shaded area is the AUC -- the probability that a randomly chosen positive case is ranked above a randomly chosen negative case.

Interpreting AUC Values

AUC = 0.5: No better than flipping a coin. The model has no discrimination ability.
AUC = 0.7-0.8: Acceptable discrimination. Common for many real-world problems.
AUC = 0.8-0.9: Good discrimination. The model reliably ranks positives above negatives.
AUC > 0.9: Excellent discrimination. Rare in noisy social science data -- check for data leakage if you see this unexpectedly.

Hyperparameter Tuning

Most machine learning models have hyperparameters -- settings that must be chosen before training begins and that can dramatically affect performance. The number of trees in a random forest, the regularization penalty in LASSO, the depth of a decision tree, the learning rate of a neural network: none of these are learned from the data, and all of them matter.

The naive approach -- trying a few values by hand and picking whatever looks best on the test set -- is statistically dangerous. If you try 100 hyperparameter combinations and select the one with the best test-set performance, you have effectively used the test set for model selection, and your reported performance is optimistically biased. The solution is to use a validation set (separate from both training and test) or, better, to use cross-validation within the training set to evaluate hyperparameter choices, reserving the test set for final evaluation only.

Grid Search

Grid search evaluates every combination of a predefined set of hyperparameter values. If you specify 5 values for the regularization parameter and 4 values for the tree depth, grid search trains and evaluates 5 x 4 = 20 models. It is simple, exhaustive, and parallelizable, but becomes prohibitively expensive when the hyperparameter space is large.

Random Search

Random search samples hyperparameter combinations randomly from specified distributions. It is often more efficient than grid search because it explores the space more broadly. A classic result by Bergstra and Bengio (2012) showed that random search finds good hyperparameters in fewer evaluations than grid search, especially when some hyperparameters matter much more than others.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load and split data
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, None],
    'min_samples_leaf': [1, 5, 10]
}

# Grid search with 5-fold CV (3 x 3 x 3 = 27 combinations)
grid_search = GridSearchCV(
    RandomForestRegressor(random_state=42),
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

# Results
print(f"\nBest hyperparameters:")
for param, val in grid_search.best_params_.items():
    print(f"  {param}: {val}")

print(f"\nBest CV RMSE: {(-grid_search.best_score_)**0.5:.4f}")

# Evaluate best model on held-out TEST set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"Test RMSE (final, unbiased): {test_rmse:.4f}")

* Stata: Manual hyperparameter tuning via cross-validation loop
* Stata does not have a built-in GridSearchCV equivalent,
* so we loop over candidate specifications manually.

sysuse auto, clear
set seed 42

* Create 5 folds for cross-validation
gen fold = ceil(5 * runiform())

* Track best specification
local best_rmse = 99999
local best_spec = ""

* Loop over candidate model specifications
foreach spec in "mpg weight" "mpg weight length" "mpg weight length foreign" "mpg weight length foreign headroom trunk" {

    local total_se = 0
    local n_test = 0

    forvalues k = 1/5 {
        quietly reg price `spec' if fold != `k'
        quietly predict double yhat_`k' if fold == `k'
        quietly gen double se_`k' = (price - yhat_`k')^2 if fold == `k'
        quietly summarize se_`k'
        local total_se = `total_se' + r(sum)
        local n_test = `n_test' + r(N)
        drop yhat_`k' se_`k'
    }

    local cv_rmse = sqrt(`total_se' / `n_test')
    display "Spec: `spec'  =>  CV RMSE = " %9.2f `cv_rmse'

    if `cv_rmse' < `best_rmse' {
        local best_rmse = `cv_rmse'
        local best_spec = "`spec'"
    }
}

display _newline "Best specification: `best_spec'"
display "Best CV RMSE:       " %9.2f `best_rmse'

# R: Hyperparameter tuning with tidymodels tune_grid()
library(tidymodels)

# Load data and split
data(mtcars)
set.seed(42)
split <- initial_split(mtcars, prop = 0.8)
train <- training(split)
test  <- testing(split)
folds <- vfold_cv(train, v = 5)

# Define a random forest model with tunable hyperparameters
rf_spec <- rand_forest(
  trees = tune(),
  min_n = tune()
) |>
  set_engine("ranger") |>
  set_mode("regression")

# Create a workflow
rf_wf <- workflow() |> add_model(rf_spec) |> add_formula(mpg ~ .)

# Define a hyperparameter grid
rf_grid <- grid_regular(
  trees(range = c(50, 300)),
  min_n(range = c(2, 10)),
  levels = 4
)

# Tune using 5-fold CV
rf_tuned <- tune_grid(
  rf_wf,
  resamples = folds,
  grid = rf_grid,
  metrics = metric_set(rmse, rsq)
)

# Show best results
cat("Top 5 configurations by RMSE:\n")
print(show_best(rf_tuned, metric = "rmse", n = 5))

# Finalize workflow with best hyperparameters
best_params <- select_best(rf_tuned, metric = "rmse")
final_wf <- finalize_workflow(rf_wf, best_params)
final_fit <- last_fit(final_wf, split)

cat("\nFinal test set metrics:\n")
print(collect_metrics(final_fit))

Python Output

Fitting 5 folds for each of 27 candidates, totalling 135 fits

Best hyperparameters:
  max_depth: 20
  min_samples_leaf: 1
  n_estimators: 200

Best CV RMSE: 0.5027
Test RMSE (final, unbiased): 0.4986

Stata Output

Spec: mpg weight                             =>  CV RMSE =   2648.93
Spec: mpg weight length                      =>  CV RMSE =   2541.37
Spec: mpg weight length foreign              =>  CV RMSE =   2486.12
Spec: mpg weight length foreign headroom trunk =>  CV RMSE =   2592.81

Best specification: mpg weight length foreign
Best CV RMSE:        2486.12

R Output

Top 5 configurations by RMSE:
# A tibble: 5 x 8
  trees min_n .metric .estimator  mean     n std_err .config
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>
1   216     2 rmse    standard    2.68     5   0.412 Preprocessor1_Model06
2   300     2 rmse    standard    2.71     5   0.398 Preprocessor1_Model08
3   133     2 rmse    standard    2.73     5   0.425 Preprocessor1_Model04
4   216     5 rmse    standard    2.79     5   0.381 Preprocessor1_Model07
5   300     5 rmse    standard    2.82     5   0.393 Preprocessor1_Model09

Final test set metrics:
# A tibble: 2 x 4
  .metric .estimator .estimate .config
  <chr>   <chr>          <dbl> <chr>
1 rmse    standard       2.54  Preprocessor1_Model1
2 rsq     standard       0.867 Preprocessor1_Model1

Overfitting Diagnostics

How can you tell whether a model is underfitting, overfitting, or well-calibrated? The most informative diagnostic tools are learning curves and validation curves. These plots show you not just the final performance number, but how performance changes as you vary the training set size or a hyperparameter. They make the bias-variance trade-off visible.

What are learning curves for?

Learning curves answer a critical practical question: should I collect more data, or do I need a different model? If your model is underfitting (both training and validation error are high), more data will not help -- you need a more flexible model. If your model is overfitting (training error is low but validation error is high), more data will help, or you can add regularization. Learning curves make this diagnosis visual and concrete.

Learning Curves

A learning curve plots both training-set and validation-set performance as the size of the training set increases. The patterns are diagnostic:

Underfitting: Both training and validation scores are low and converge quickly. Adding more data will not help -- the model is too simple. You need a more flexible model.
Overfitting: Training score is high, but validation score is significantly lower. The gap between the two curves is large. Adding more data or regularization can help.
Good fit: Training score is high, validation score is close to it, and both stabilize as the training set grows.

Learning Curve Patterns

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import learning_curve

# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Compute learning curve
train_sizes, train_scores, val_scores = learning_curve(
    RandomForestRegressor(n_estimators=100, random_state=42),
    X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

# Convert to RMSE (scores are negative MSE)
train_rmse = np.sqrt(-train_scores)
val_rmse   = np.sqrt(-val_scores)

# Print numerical summary
print("Learning Curve Summary")
print("=" * 55)
print(f"{'Train Size':>12} {'Train RMSE':>12} {'Val RMSE':>12}")
print("-" * 55)
for i in range(len(train_sizes)):
    print(f"{train_sizes[i]:>12,} {train_rmse.mean(axis=1)[i]:>12.4f} {val_rmse.mean(axis=1)[i]:>12.4f}")

# Optional: plot with matplotlib
# import matplotlib.pyplot as plt
# plt.plot(train_sizes, train_rmse.mean(axis=1), 'o-', label='Training RMSE')
# plt.plot(train_sizes, val_rmse.mean(axis=1), 'o-', label='Validation RMSE')
# plt.xlabel('Training Set Size'); plt.ylabel('RMSE')
# plt.legend(); plt.show()

* Stata: Manual learning curve computation
* Train on increasing fractions of data, evaluate on held-out test set

sysuse auto, clear
set seed 42

* Reserve 20% as permanent test set
gen rand = runiform()
gen byte is_test = (rand > 0.8)

* Count training observations
quietly count if is_test == 0
local n_train = r(N)

* Generate a random ordering for training observations
gen train_order = runiform() if is_test == 0
sort is_test train_order

* Loop over increasing training set fractions
display "Learning Curve: RMSE by Training Set Size"
display "==========================================="
display "{txt}  Train Size   Train RMSE    Test RMSE"
display "-------------------------------------------"

foreach frac in 0.2 0.4 0.6 0.8 1.0 {
    local n_use = floor(`frac' * `n_train')

    * Mark which training obs to use
    gen byte use_train = (_n <= `n_use') & (is_test == 0)

    * Fit model on subset
    quietly reg price mpg weight length if use_train == 1
    quietly predict double yhat

    * Train RMSE
    quietly gen double se_train = (price - yhat)^2 if use_train == 1
    quietly summarize se_train
    local rmse_train = sqrt(r(mean))

    * Test RMSE
    quietly gen double se_test = (price - yhat)^2 if is_test == 1
    quietly summarize se_test
    local rmse_test = sqrt(r(mean))

    display %12.0f `n_use' %12.2f `rmse_train' %12.2f `rmse_test'

    drop use_train yhat se_train se_test
}

# R: Custom learning curve function
library(randomForest)
library(rsample)

# Load data and split
data(mtcars)
set.seed(42)
split <- initial_split(mtcars, prop = 0.8)
train <- training(split)
test  <- testing(split)

# Define training set fractions
fractions <- seq(0.3, 1.0, by = 0.1)
results <- data.frame(frac = fractions,
                       n = integer(length(fractions)),
                       train_rmse = numeric(length(fractions)),
                       test_rmse = numeric(length(fractions)))

for (i in seq_along(fractions)) {
  n_use <- floor(fractions[i] * nrow(train))
  idx <- sample(nrow(train), n_use)
  train_sub <- train[idx, ]

  # Train model on subset
  rf <- randomForest(mpg ~ ., data = train_sub, ntree = 100)

  # Train RMSE
  train_pred <- predict(rf, train_sub)
  results$train_rmse[i] <- sqrt(mean((train_sub$mpg - train_pred)^2))

  # Test RMSE
  test_pred <- predict(rf, test)
  results$test_rmse[i] <- sqrt(mean((test$mpg - test_pred)^2))

  results$n[i] <- n_use
}

# Print results
cat("Learning Curve Summary\n")
cat("==========================================\n")
cat(sprintf("%12s %12s %12s\n", "Train Size", "Train RMSE", "Test RMSE"))
cat("------------------------------------------\n")
for (i in 1:nrow(results)) {
  cat(sprintf("%12d %12.4f %12.4f\n",
      results$n[i], results$train_rmse[i], results$test_rmse[i]))
}

Python Output

Learning Curve Summary
=======================================================
  Train Size   Train RMSE     Val RMSE
-------------------------------------------------------
       1,651       0.1834       0.5823
       3,302       0.1921       0.5547
       4,953       0.1956       0.5388
       6,604       0.1963       0.5285
       8,256       0.1976       0.5218
       9,907       0.1987       0.5166
      11,558       0.1992       0.5121
      13,209       0.1998       0.5092
      14,860       0.2001       0.5068
      16,512       0.2005       0.5047

Stata Output

Learning Curve: RMSE by Training Set Size
===========================================
  Train Size   Train RMSE    Test RMSE
-------------------------------------------
          12      1847.35      3142.68
          23      2012.48      2784.51
          35      2098.63      2621.34
          47      2134.17      2512.89
          59      2156.42      2461.38

R Output

Learning Curve Summary
==========================================
  Train Size   Train RMSE    Test RMSE
------------------------------------------
           7       1.2341       4.8712
          10       1.3028       3.9456
          12       1.3894       3.5218
          15       1.4216       3.2847
          17       1.4587       3.1025
          20       1.4812       2.9483
          22       1.5034       2.8156
          25       1.5198       2.7134

Data Leakage

The Test-Answers Analogy

Data leakage is like accidentally seeing the test answers before the exam -- your score looks great, but you have not actually learned anything. When you deploy the model in the real world (where the "answers" are not available), performance collapses. Leakage is the #1 mistake that invalidates ML results, and it can be extremely subtle.

Data leakage occurs when information from the test set (or from the future, or from the outcome itself) sneaks into the training process. It is the single most common and most dangerous mistake in applied machine learning. Models with leakage produce spectacular evaluation metrics that evaporate when the model meets real data.

Common Sources of Leakage

1. Preprocessing on Full Data

If you normalize features (subtract mean, divide by standard deviation) using statistics computed on the entire dataset -- including test observations -- the model has indirectly "seen" test data. The fix: compute all preprocessing statistics (mean, std, min, max) on the training set only, then apply those same transformations to the test set. In scikit-learn, this means calling scaler.fit(X_train) and then scaler.transform(X_test) -- never scaler.fit(X) on the full dataset.

2. Future Information in Features

If you use variables that would not be available at prediction time, you have leakage. For example, using next quarter's GDP to predict this quarter's unemployment, or using a "days until bankruptcy" variable to predict whether a firm will go bankrupt. Any feature that is a consequence of the outcome rather than a cause or predictor is a leakage risk. Always ask: "Would I have this variable available at the time I need to make the prediction?"

3. Duplicates Across Train/Test

If the same observation appears in both the training and test set (due to duplicated rows, overlapping time windows, or group-level correlation), the model can memorize it during training and "predict" it perfectly during testing. This is especially common with panel data where the same individual appears in multiple rows -- if you split randomly, you may train on person X in 2019 and test on person X in 2020, which is not a true out-of-sample test.

How to Detect Leakage

The clearest warning sign is suspiciously good performance. If your model achieves an AUC of 0.99 on a messy social science prediction task, something is probably wrong. Other red flags:

Test performance is much higher than you expected based on the literature or the difficulty of the problem
Performance drops dramatically when you switch to a truly temporal split or a new dataset
The most important features (from feature importance) are variables that seem to "encode" the outcome rather than predict it
Adding more data does not change performance at all -- the model has already memorized the relationship

Model Selection and Comparison

In practice, model evaluation is not about computing a single metric for a single model. It is a systematic process of comparing multiple candidate models to select the best one for your task. A reliable model comparison workflow follows these steps:

Split your data into training and test sets (or use nested cross-validation for small datasets). The test set is locked away until the very end.
Define candidate models: OLS, LASSO, random forest, gradient boosting, neural network, etc. Include simple baselines (e.g., "always predict the mean") to ensure your model adds value.
Tune hyperparameters for each candidate using cross-validation on the training set only. Each model gets its best hyperparameters.
Compare cross-validated performance across models. Use the same folds for all models to ensure a fair comparison.
Select the best model and evaluate it once on the test set. This is your final, unbiased performance estimate. Do not go back and try more models after seeing the test-set result.

Prediction Task Evaluation

Goal: maximize out-of-sample prediction accuracy
Metrics: RMSE, MAE, AUC-ROC, F1
Method: cross-validation + held-out test set
Flexibility is a virtue if it improves predictions
Examples: forecasting, targeting, classification

Causal Task Evaluation

Goal: estimate a causal effect (ATE, CATE, ATT)
Metrics: bias, coverage of confidence intervals, MSE of the treatment effect estimate
Method: simulation studies, placebo tests, sensitivity analysis
Interpretability and identification matter more than raw fit
Examples: policy evaluation, treatment effects, DiD

The distinction above is fundamental. The entire model evaluation framework in this module -- cross-validation, RMSE, AUC-ROC, hyperparameter tuning -- is designed for prediction tasks. When the goal is causal inference, the evaluation criteria are different: you care about whether the estimated treatment effect is unbiased and whether the confidence intervals have correct coverage, not about whether the model predicts the outcome well. A model can be an excellent predictor but a terrible estimator of causal effects, and vice versa. See Module 11D (Causal ML) for evaluation strategies specific to causal tasks.

References

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer. (Available free at hastie.su.domains/ElemStatLearn)
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer. (Available free at statlearning.com)
Athey, S., & Imbens, G. W. (2019). Machine Learning Methods That Economists Should Know About. Annual Review of Economics, 11, 685-725.
Bergstra, J., & Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13, 281-305.
Mullainathan, S., & Spiess, J. (2017). Machine Learning: An Applied Econometric Approach. Journal of Economic Perspectives, 31(2), 87-106.

ProTools ER1

Course Modules

11E Model Evaluation

Learning Objectives

Contents

Why Evaluation Matters

The Exam Analogy

The Cardinal Rule of Model Evaluation

Train-Test Split

When Random Splits Are Wrong

K-Fold Cross-Validation

The Practice Exam Analogy

5-Fold Cross-Validation

Regression Metrics

What are regression metrics for?

Root Mean Squared Error (RMSE)

Mean Absolute Error (MAE)

R-squared (Coefficient of Determination)

Mean Absolute Percentage Error (MAPE)

When to Use RMSE

When to Use MAE

Classification Metrics

What are classification metrics for?

Economics Example: The Accuracy Trap

Confusion Matrix

Accuracy

Precision and Recall

The Spam Filter Analogy

Precision-Recall Tradeoff

F1 Score

Precision-Focused Problems

Recall-Focused Problems

AUC-ROC and Threshold-Independent Evaluation

The Card-Sorting Analogy

ROC Curves: Good Model vs. Random vs. Perfect

Interpreting AUC Values

Hyperparameter Tuning

Grid Search

Random Search

Overfitting Diagnostics

What are learning curves for?

Learning Curves

Learning Curve Patterns

Data Leakage

The Test-Answers Analogy

Common Sources of Leakage

1. Preprocessing on Full Data

2. Future Information in Features

3. Duplicates Across Train/Test

How to Detect Leakage

Model Selection and Comparison

Prediction Task Evaluation

Causal Task Evaluation

References

ProTools ER1 Assistant