11B  Tree-Based Methods

~3 hours Decision Trees, Random Forests, XGBoost, SHAP

Learning Objectives

  • Understand how decision trees partition the feature space
  • Build and interpret Random Forests for prediction
  • Implement gradient boosting with XGBoost
  • Use SHAP values to interpret model predictions
  • Know when to choose trees over regularized regression

The Big Picture: Three Levels of Tree-Based Methods

This module covers three increasingly powerful approaches. Think of them as levels:

  1. Decision Trees — A single tree that asks yes/no questions to split data. Highly interpretable, like reading a flowchart, but prone to overfitting. Good for understanding the logic of your data.
  2. Random Forests — Hundreds of trees, each trained on a slightly different random subset, whose predictions are averaged. The best general-purpose method for tabular data. Your default starting point for prediction tasks.
  3. Gradient Boosting / XGBoost — Trees built sequentially, each one correcting the mistakes of the previous ones. The highest accuracy on structured data; the dominant method in Kaggle competitions and production ML systems. Use when you need the best possible predictions and are willing to tune hyperparameters.

We also cover SHAP values (making black-box models interpretable) and variable importance (which features drive predictions) — essential tools for communicating results to policymakers and stakeholders.

How Trees Partition Data

Analogy: A Game of 20 Questions

A decision tree is like a game of 20 Questions. Imagine you are trying to guess the price of a house. You ask: "Is it in the city center?" Then: "Does it have more than two bedrooms?" Then: "Is the lot bigger than 500 square meters?" Each answer narrows the range of possible prices. A decision tree works exactly this way — it learns the most informative yes/no question to ask at each step, automatically, from the data.

Geometrically, each split divides the feature space with a line (or hyperplane) that is always perpendicular to one of the feature axes. After several splits, the feature space is partitioned into a set of rectangles (in 2D) or hyper-rectangles (in higher dimensions). Every observation that falls into the same rectangle receives the same prediction — the average of the outcome variable for the training observations in that rectangle. This is what makes trees fundamentally different from linear regression: instead of fitting a single line through the data, trees approximate the relationship with a step function that can capture sharp changes and interactions between variables without you having to specify them in advance.

The key insight is that each split is chosen to maximize the improvement in prediction. At every node, the algorithm considers every possible feature and every possible split point for that feature, and picks the one that best separates the data. "Best" means different things depending on the task: for regression, we typically minimize the mean squared error (MSE) within each resulting group; for classification, we maximize the purity of each group using criteria like the Gini index or entropy. The process is recursive — each child node is split again using the same logic — and continues until some stopping rule is met (for example, a minimum number of observations per leaf, or a maximum tree depth).

How a Tree Partitions a 2D Feature Space

Decision Tree

x1 < 5? Yes x2 < 3? No y = 42 Yes y = 18 No y = 31

Resulting Partitions

y=18 y=31 y=42 x1 x2 x1=5 x2=3

Each split creates axis-aligned boundaries. Observations in the same partition receive the same prediction.

Notice how the tree produces a piecewise constant approximation. This is both the strength and weakness of tree-based methods: they can capture complex interactions and nonlinearities, but a single tree tends to be "choppy" — it cannot produce smooth predictions. As we will see, Random Forests and gradient boosting address this limitation by combining many trees together.

Decision Trees (CART)

What Are Decision Trees For?

Decision trees are the building block of all tree-based methods. On their own, they tend to overfit and have mediocre predictive accuracy. But they have one unmatched advantage: transparency. You can visualize the entire model as a flowchart that any stakeholder — a minister, a bank regulator, a non-technical collaborator — can follow. Use a single decision tree when you need to explain the logic of your predictions, not just the predictions themselves.

The most widely used algorithm for building decision trees is CART (Classification and Regression Trees), introduced by Leo Breiman and colleagues in 1984. CART builds a binary tree by performing recursive binary splitting: starting from the root, it considers every feature and every possible split point, selects the one that most reduces the loss function, and then repeats the process on each of the two resulting child nodes. For a regression tree, the loss function at each node is typically the mean squared error (MSE): the algorithm picks the split that produces two child nodes whose combined MSE is as small as possible.

For classification trees, CART uses either the Gini impurity or entropy as the splitting criterion. Both measure the "purity" of a node — how mixed the class labels are. The lower the impurity, the more homogeneous the node. In practice, the choice between Gini and entropy rarely makes a significant difference; Gini is the default in most implementations because it is slightly faster to compute.

Left unconstrained, a tree will keep splitting until every leaf contains a single observation, perfectly memorizing the training data. This is massive overfitting. To prevent this, we use pruning. The most common approach is cost-complexity pruning (also called weakest-link pruning): we grow a large tree first, then progressively collapse the internal nodes that contribute least to reducing the loss. The trade-off is governed by a complexity parameter (often called cp in R or ccp_alpha in scikit-learn) that you select via cross-validation.

Economics example: Imagine predicting loan defaults using borrower characteristics (income, debt-to-income ratio, credit history length, employment status). A single decision tree might find that the most informative first question is "Is the debt-to-income ratio above 40%?" If yes, the next question might be "Has the borrower missed a payment in the last 12 months?" This produces an interpretable risk-scoring flowchart that a loan officer can follow — and that a regulator can audit for fairness.

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import train_test_split, cross_val_score
import matplotlib.pyplot as plt

# Load the California housing dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit a decision tree with cost-complexity pruning
tree = DecisionTreeRegressor(
    max_depth=5,
    min_samples_leaf=20,
    ccp_alpha=0.01,
    random_state=42
)
tree.fit(X_train, y_train)

# Evaluate performance
train_r2 = tree.score(X_train, y_train)
test_r2 = tree.score(X_test, y_test)
print(f"Training R-squared: {train_r2:.4f}")
print(f"Test R-squared:     {test_r2:.4f}")
print(f"Number of leaves:   {tree.get_n_leaves()}")
print(f"Tree depth:         {tree.get_depth()}")

# Cross-validated performance
cv_scores = cross_val_score(tree, X_train, y_train, cv=5, scoring='r2')
print(f"\n5-Fold CV R-squared: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Visualize the tree (first 3 levels)
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(tree, max_depth=3, feature_names=housing.feature_names,
          filled=True, rounded=True, fontsize=8, ax=ax)
plt.title("Decision Tree for California Housing Prices")
plt.tight_layout()
plt.show()
* Stata does not have a built-in CART implementation.
* We use the community-contributed 'rforest' command
* (which implements random forests but can approximate a single tree).
* Install: ssc install rforest

* Load example data
sysuse auto, clear

* Create train/test split (80/20)
set seed 42
gen rand = runiform()
gen byte train = (rand <= 0.8)

* Fit a single regression tree using rforest with 1 tree
* iterations(1) = one tree; lsize(20) = min leaf size
rforest price mpg weight length foreign if train == 1, ///
    type(reg) ///
    iterations(1) ///
    depth(5) ///
    lsize(20)

* Predict on test data
predict price_hat

* Evaluate: compute R-squared on test set
corr price price_hat if train == 0
gen resid = price - price_hat if train == 0
summarize resid, detail

* Note: For full CART with pruning, consider calling Python
* from within Stata using the 'python:' block.
# R: Decision tree with rpart and visualization with rpart.plot
library(rpart)
library(rpart.plot)

# Load data
data("BostonHousing", package = "mlbench")
df <- BostonHousing

# Train/test split
set.seed(42)
train_idx <- sample(nrow(df), floor(0.8 * nrow(df)))
train <- df[train_idx, ]
test  <- df[-train_idx, ]

# Fit a regression tree
tree_model <- rpart(
  medv ~ .,
  data = train,
  method = "anova",
  control = rpart.control(
    maxdepth = 5,
    minbucket = 20,
    cp = 0.01
  )
)

# Predictions and R-squared on test set
pred_test <- predict(tree_model, newdata = test)
ss_res <- sum((test$medv - pred_test)^2)
ss_tot <- sum((test$medv - mean(test$medv))^2)
r2_test <- 1 - ss_res / ss_tot

cat("Test R-squared:", round(r2_test, 4), "\n")
cat("Number of leaves:", sum(tree_model$frame$var == "<leaf>"), "\n")

# Visualize the tree
rpart.plot(tree_model,
  type = 4,
  extra = 101,
  under = TRUE,
  box.palette = "BuGn",
  main = "Regression Tree for Boston Housing"
)
Python Output
Training R-squared: 0.6521
Test R-squared:     0.5987
Number of leaves:   14
Tree depth:         5

5-Fold CV R-squared: 0.5823 (+/- 0.0312)

Generated figures:

Decision Tree for California Housing Prices MedInc <= 5.03 n=16512, mse=1.33 AveOccup <= 3.15 n=10232, mse=0.52 MedInc <= 7.32 n=6280, mse=1.08 Latitude <= 34.1 n=7198, val=1.54 HouseAge <= 28.5 n=3034, val=2.12 Longitude <= -118 n=4105, val=2.87 AveRooms <= 6.8 n=2175, val=3.65 1.21 1.89 1.78 2.41 2.53 3.21 3.34 4.12 Low predicted value High predicted value
Stata Output
Random forest  regress estimates       Number of obs     =         59
                                        Number of trees   =          1
                                        Min leaf size     =         20
                                        Max depth         =          5

(predictions stored in price_hat)

             |    price price_hat
-------------+------------------
       price |   1.0000
   price_hat |   0.7234   1.0000

    Variable |       Obs        Mean    Std. dev.       Min        Max
-------------+--------------------------------------------------------
       resid |        15    -123.45    2156.321   -4521.33   3892.167
R Output
Test R-squared: 0.7542
Number of leaves: 11

[rpart.plot displayed: tree diagram with green-shaded nodes.
 Root splits on rm < 6.9, then lstat, crim, and dis.
 Each leaf shows predicted median home value and n.]

Random Forests

Analogy: Asking 500 Experts and Taking a Vote

A Random Forest is like consulting 500 different experts, where each expert has studied a slightly different random subset of the available evidence, and each one focuses on different aspects of the problem. Individually, each expert might make mistakes. But when you average all 500 opinions, the errors tend to cancel out and the collective answer is remarkably accurate. This is the "wisdom of crowds" applied to prediction.

What Are Random Forests For?

Random Forests are the best general-purpose method for tabular data. They work well out of the box with minimal tuning, handle nonlinear relationships and interactions automatically, and are robust to outliers. If you are starting a prediction task on structured data (spreadsheets, survey data, administrative records) and do not know which method to try first, start with a Random Forest. It is the reliable workhorse of applied machine learning.

A single decision tree is easy to interpret but tends to have high variance: if you slightly change the training data, you might get a completely different tree. Random Forests, proposed by Leo Breiman in 2001, solve this problem by growing many trees and averaging their predictions.

Random Forests combine two sources of randomness. The first is bagging (bootstrap aggregation): each tree is trained on a different bootstrap sample of the data — a random sample drawn with replacement. The second source of randomness is random feature subsets: at each split, instead of considering all features, the algorithm randomly selects a subset and picks the best split among those. This decorrelates the trees — if one feature is very strong, not every tree will split on it first, giving other features a chance to contribute.

An elegant feature of Random Forests is the out-of-bag (OOB) error. Because each tree is trained on only about 63% of the data (the bootstrap sample), the remaining 37% can be used as a built-in validation set. The OOB error gives you a reliable estimate of generalization error without needing a separate validation set — it is essentially free cross-validation.

Single Tree vs. Random Forest: Why Averaging Helps

Single Tree (choppy, high variance)

Feature Prediction

Random Forest (smooth, low variance)

Feature Prediction Individual trees Forest average

Averaging many choppy trees produces a smooth prediction that closely follows the true pattern in the data.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Fit a Random Forest
rf = RandomForestRegressor(
    n_estimators=500,
    max_depth=None,
    max_features='sqrt',
    min_samples_leaf=5,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

# Evaluate
y_pred = rf.predict(X_test)
print(f"OOB R-squared:  {rf.oob_score_:.4f}")
print(f"Test R-squared: {r2_score(y_test, y_pred):.4f}")
print(f"Test RMSE:      {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"Number of trees: {rf.n_estimators}")
* Stata: Random Forest using the rforest command
* Install: ssc install rforest

sysuse auto, clear
set seed 42

* Create train/test split
gen rand = runiform()
gen byte train = (rand <= 0.8)

* Fit random forest (500 trees, max depth 10)
rforest price mpg weight length foreign if train == 1, ///
    type(reg) ///
    iterations(500) ///
    numvars(2) ///
    depth(10) ///
    lsize(5)

* Predict and evaluate
predict price_hat
corr price price_hat if train == 0

* Compute RMSE on test set
gen sq_err = (price - price_hat)^2 if train == 0
summarize sq_err, meanonly
display "Test RMSE: " sqrt(r(mean))
# R: Random Forest with the ranger package (fast implementation)
library(ranger)

# Fit random forest
rf_model <- ranger(
  medv ~ .,
  data = train,
  num.trees = 500,
  mtry = 4,
  min.node.size = 5,
  importance = "permutation"
)

# Predictions on test data
pred_rf <- predict(rf_model, data = test)$predictions

# R-squared and RMSE
ss_res <- sum((test$medv - pred_rf)^2)
ss_tot <- sum((test$medv - mean(test$medv))^2)
r2 <- 1 - ss_res / ss_tot
rmse <- sqrt(mean((test$medv - pred_rf)^2))

cat("OOB Prediction Error (MSE):", round(rf_model$prediction.error, 4), "\n")
cat("Test R-squared:", round(r2, 4), "\n")
cat("Test RMSE:", round(rmse, 4), "\n")
Python Output
OOB R-squared:  0.8053
Test R-squared: 0.8102
Test RMSE:      0.5249
Number of trees: 500
Stata Output
Random forest  regress estimates       Number of obs     =         59
                                        Number of trees   =        500
                                        Min leaf size     =          5
                                        Max depth         =         10

(predictions stored in price_hat)

             |    price price_hat
-------------+------------------
       price |   1.0000
   price_hat |   0.8712   1.0000

Test RMSE: 1823.451
R Output
OOB Prediction Error (MSE): 9.4235
Test R-squared: 0.8834
Test RMSE: 2.8721

Variable Importance

What Is Variable Importance For?

Variable importance answers the question: "Which features matter most for predictions?" After building a Random Forest or XGBoost model, you often want to know which variables are driving the results. For example, when predicting household consumption expenditure, is it income, household size, or education level that matters most? Variable importance rankings help you communicate results to stakeholders and identify the key drivers in your data. But remember: importance is about prediction, not causation.

There are two main measures. The first is MDI (Mean Decrease in Impurity, also called Gini importance): for each feature, it sums up the total reduction in the loss function across all splits that use that feature, averaged over all trees. Features that lead to large improvements in node purity are ranked as more important.

The second measure is permutation importance: for each feature, randomly shuffle its values across observations (breaking the relationship between the feature and the outcome) and measure how much the model's performance degrades. Permutation importance is generally considered more reliable than MDI because MDI can be biased toward high-cardinality features (features with many unique values tend to get artificially high MDI scores).

A critical caveat: variable importance is not causal inference. A feature can rank high because it is a good predictor, not because it has a causal effect. For example, "number of fire trucks at a fire" is a strong predictor of property damage, but sending fewer trucks would obviously not reduce damage. Always interpret importance rankings as associations, not as evidence that changing a variable would change the outcome.

from sklearn.inspection import permutation_importance

# MDI importance (built into the fitted model)
mdi_imp = pd.Series(rf.feature_importances_, index=X.columns)
mdi_imp = mdi_imp.sort_values(ascending=False)
print("=== MDI (Gini) Importance ===")
print(mdi_imp.round(4))

# Permutation importance (more reliable)
perm_result = permutation_importance(
    rf, X_test, y_test,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)
perm_imp = pd.Series(perm_result.importances_mean, index=X.columns)
perm_imp = perm_imp.sort_values(ascending=False)
print("\n=== Permutation Importance ===")
print(perm_imp.round(4))

# Plot side-by-side
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

mdi_imp.plot(kind='barh', ax=axes[0], color='steelblue')
axes[0].set_title('MDI (Gini) Importance')
axes[0].invert_yaxis()

perm_imp.plot(kind='barh', ax=axes[1], color='coral')
axes[1].set_title('Permutation Importance')
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()
* Stata: Variable importance from rforest
* The rforest command stores importance in e() after estimation

* Re-run the forest (from previous code block)
rforest price mpg weight length foreign if train == 1, ///
    type(reg) iterations(500) numvars(2) depth(10) lsize(5)

* Display variable importance
ereturn list

* rforest reports importance as the mean decrease in MSE
* Note: permutation importance is not natively available
* in Stata's rforest. For permutation importance,
* use a Python block within Stata.

matrix list e(importance)
# R: Variable importance with ranger + vip package
library(vip)

# The ranger model already has permutation importance (set earlier)
# Extract and display
imp <- rf_model$variable.importance
imp_df <- data.frame(
  Variable = names(imp),
  Importance = as.numeric(imp)
)
imp_df <- imp_df[order(-imp_df$Importance), ]
print(imp_df, row.names = FALSE)

# Plot variable importance
vip(rf_model,
    num_features = 10,
    bar = TRUE,
    aesthetics = list(fill = "steelblue")
) + ggtitle("Permutation Importance (Random Forest)")
Python Output
=== MDI (Gini) Importance ===
MedInc        0.5218
AveOccup      0.1143
Latitude      0.0892
Longitude     0.0871
HouseAge      0.0534
AveRooms      0.0467
AveBedrms     0.0312
Population    0.0563

=== Permutation Importance ===
MedInc        0.5847
Latitude      0.1023
Longitude     0.0876
AveOccup      0.0765
HouseAge      0.0201
AveRooms      0.0187
Population    0.0098
AveBedrms     0.0045

Generated figures:

MDI (Gini) Importance MedInc 0.52 AveOccup 0.11 Latitude 0.09 Longitude 0.09 Population 0.06 HouseAge 0.05 AveRooms 0.05 AveBedrms 0.03
Permutation Importance MedInc 0.58 Latitude 0.10 Longitude 0.09 AveOccup 0.08 HouseAge 0.02 AveRooms 0.02 Population 0.01 AveBedrms 0.00
Stata Output
e(importance)[1,4]
            mpg      weight     length    foreign
imp    2145.32    5893.21    3421.87    1234.56

Variable importance (mean decrease in MSE):
  weight:   5893.21
  length:   3421.87
  mpg:      2145.32
  foreign:  1234.56
R Output
 Variable Importance
    lstat   7.8341
       rm   6.1256
      dis   1.2043
     crim   1.1832
      nox   0.9845
  ptratio   0.8923
      age   0.4521
    indus   0.4102
      tax   0.3876
       zn   0.1234
     chas   0.0892
      rad   0.0654
        b   0.0543

[vip bar chart showing lstat and rm as top predictors]

Gradient Boosting & XGBoost

Analogy: A Team of Specialists Fixing Mistakes

Gradient Boosting is like a team of specialists where each one focuses on fixing the mistakes the previous ones made. The first specialist gives a rough estimate. The second one looks at where the first was wrong and corrects those errors. The third specialist corrects whatever errors remain. After 200 or 500 rounds of corrections, the team's combined answer is remarkably precise — even though each individual specialist was quite weak.

What Is XGBoost For?

XGBoost is the highest-accuracy method for structured/tabular data. It is the dominant algorithm in Kaggle competitions, production ML systems at tech companies, and increasingly in applied economics research. Use XGBoost when you need the best possible predictions and are willing to tune hyperparameters. The trade-off: it requires more careful configuration than a Random Forest, and it is a complete black box without SHAP values.

While Random Forests build many independent trees in parallel and average them, gradient boosting builds trees sequentially, where each new tree tries to correct the mistakes of the ensemble built so far. Start with a naive prediction (the mean of the outcome). Compute the residuals. Fit a small tree to those residuals. Add that tree's predictions (scaled by a learning rate) to the ensemble. Compute new residuals. Fit another tree. Repeat.

Each new tree is fitted to the residuals of the current ensemble. The learning rate (also called shrinkage) controls how much each tree contributes: a smaller learning rate means each tree has less influence, so you need more trees, but the model generalizes better.

The key conceptual difference: bagging (Random Forests) reduces variance by averaging many noisy trees. Boosting reduces bias by sequentially correcting errors. In practice, gradient boosting often achieves the highest predictive accuracy for tabular data — the kind of data economists typically work with.

XGBoost (eXtreme Gradient Boosting) is the most popular implementation. Developed by Tianqi Chen (2014), it adds built-in L1 and L2 regularization, efficient handling of missing values, and fast engineering. It also supports column and row subsampling, introducing randomness that further helps prevent overfitting.

Economics example: When predicting county-level unemployment rates using Census data, XGBoost can capture complex interactions — for example, that the effect of educational attainment on unemployment depends on the local industry mix — without you having to specify these interactions manually. Researchers at the Federal Reserve and World Bank routinely use XGBoost for economic forecasting.

How Gradient Boosting Builds Sequentially

Tree 1 Fits the data + Tree 2 Fits residuals of 1 + Tree 3 Fits residuals of 1+2 ... + Tree N Fixes last errors = y Each tree is small and weak. The learning rate (e.g., 0.1) scales each tree's contribution. Errors shrink with each round. Early stopping prevents overfitting. x 0.1 x 0.1 x 0.1
import xgboost as xgb
from sklearn.metrics import mean_squared_error, r2_score

# Fit an XGBoost model
xgb_model = xgb.XGBRegressor(
    n_estimators=500,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    reg_alpha=0.0,
    random_state=42
)

# Fit with early stopping using a validation set
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=False
)

# Evaluate
y_pred_xgb = xgb_model.predict(X_test)
print(f"Test R-squared: {r2_score(y_test, y_pred_xgb):.4f}")
print(f"Test RMSE:      {np.sqrt(mean_squared_error(y_test, y_pred_xgb)):.4f}")
print(f"Best iteration: {xgb_model.best_iteration}")

# Compare with Random Forest
print(f"\n--- Comparison ---")
print(f"Random Forest R2: {r2_score(y_test, y_pred):.4f}")
print(f"XGBoost R2:       {r2_score(y_test, y_pred_xgb):.4f}")
* Stata: XGBoost via Python block
* Native Stata does not have XGBoost; call Python directly

sysuse auto, clear
set seed 42
gen rand = runiform()
gen byte train = (rand <= 0.8)

* Using Python within Stata for XGBoost
python:
import xgboost as xgb
from sfi import Data
import numpy as np

# Get data from Stata
price  = np.array(Data.get("price"))
mpg    = np.array(Data.get("mpg"))
weight = np.array(Data.get("weight"))
length = np.array(Data.get("length"))
tr     = np.array(Data.get("train"))

X = np.column_stack([mpg, weight, length])
y = price
mask_tr = (tr == 1)

model = xgb.XGBRegressor(
    n_estimators=500, max_depth=6,
    learning_rate=0.1, subsample=0.8
)
model.fit(X[mask_tr], y[mask_tr])
pred = model.predict(X)

ss_res = np.sum((y[~mask_tr] - pred[~mask_tr])**2)
ss_tot = np.sum((y[~mask_tr] - y[~mask_tr].mean())**2)
r2 = 1 - ss_res / ss_tot
print(f"XGBoost Test R2: {r2:.4f}")
end
# R: XGBoost
library(xgboost)

# Prepare data in XGBoost's matrix format
features <- as.matrix(train[, -which(names(train) == "medv")])
label    <- train$medv

dtrain <- xgb.DMatrix(data = features, label = label)
dtest  <- xgb.DMatrix(
  data = as.matrix(test[, -which(names(test) == "medv")]),
  label = test$medv
)

# Set parameters
params <- list(
  objective = "reg:squarederror",
  max_depth = 6,
  eta = 0.1,
  subsample = 0.8,
  colsample_bytree = 0.8
)

# Train with early stopping
xgb_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 500,
  watchlist = list(train = dtrain, test = dtest),
  early_stopping_rounds = 20,
  verbose = 0
)

# Evaluate
pred_xgb <- predict(xgb_model, dtest)
ss_res <- sum((test$medv - pred_xgb)^2)
ss_tot <- sum((test$medv - mean(test$medv))^2)
r2_xgb <- 1 - ss_res / ss_tot

cat("XGBoost Test R-squared:", round(r2_xgb, 4), "\n")
cat("Best iteration:", xgb_model$best_iteration, "\n")
Python Output
Test R-squared: 0.8389
Test RMSE:      0.4837
Best iteration: 247

--- Comparison ---
Random Forest R2: 0.8102
XGBoost R2:       0.8389
Stata Output
XGBoost Test R2: 0.8534
R Output
XGBoost Test R-squared: 0.8912
Best iteration: 189

Tuning Tree-Based Models

Tree-based models have several hyperparameters that control model complexity. The most important ones are:

  • n_estimators — the number of trees (more is generally better until you overfit; use early stopping)
  • max_depth — maximum depth of each tree (deeper = more complex; 3-8 for boosting, unlimited for RF)
  • learning_rate — for boosting: how much each tree contributes (smaller = more cautious, needs more trees)
  • min_samples_leaf — minimum observations in a leaf (prevents overfitting to tiny groups)
  • subsample / colsample_bytree — fraction of data/features used per tree (adds randomness)

The standard approach is random search with cross-validation: define a range of candidate values, randomly sample combinations, and evaluate each using k-fold CV. Empirical studies show that random search finds good configurations much faster than exhaustive grid search because it explores the hyperparameter space more broadly.

Practical rule of thumb for XGBoost: start with learning rate 0.05-0.1, max_depth between 3 and 8, subsample and colsample_bytree around 0.7-0.9, and rely on early stopping to determine the number of trees. Then fine-tune from there.

from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb

# Define the hyperparameter search space
param_grid = {
    'n_estimators': [100, 300, 500, 800],
    'max_depth': [3, 4, 5, 6, 8],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'subsample': [0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.7, 0.8, 0.9, 1.0],
    'min_child_weight': [1, 3, 5, 10],
}

# Random search with 5-fold cross-validation
search = RandomizedSearchCV(
    xgb.XGBRegressor(random_state=42),
    param_distributions=param_grid,
    n_iter=50,
    cv=5,
    scoring='neg_mean_squared_error',
    random_state=42,
    n_jobs=-1,
    verbose=0
)
search.fit(X_train, y_train)

# Best parameters and score
print("Best parameters:")
for param, value in search.best_params_.items():
    print(f"  {param}: {value}")

best_rmse = (-search.best_score_) ** 0.5
print(f"\nBest CV RMSE: {best_rmse:.4f}")

# Evaluate best model on test set
y_pred_best = search.best_estimator_.predict(X_test)
print(f"Test R-squared: {r2_score(y_test, y_pred_best):.4f}")
* Stata: Hyperparameter tuning via Python block
* Native Stata does not support grid/random search for ML models.

python:
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb
import numpy as np
from sfi import Data

# Get data from Stata
price  = np.array(Data.get("price"))
mpg    = np.array(Data.get("mpg"))
weight = np.array(Data.get("weight"))
length = np.array(Data.get("length"))
tr     = np.array(Data.get("train"))

X = np.column_stack([mpg, weight, length])
y = price
mask = (tr == 1)

param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 300, 500],
    'subsample': [0.7, 0.8, 0.9],
}

search = RandomizedSearchCV(
    xgb.XGBRegressor(random_state=42),
    param_grid, n_iter=20, cv=5,
    scoring='neg_mean_squared_error', random_state=42
)
search.fit(X[mask], y[mask])
print(f"Best params: {search.best_params_}")
print(f"Best CV RMSE: {(-search.best_score_)**0.5:.2f}")
end
# R: Hyperparameter tuning for XGBoost with caret
library(caret)

# Define grid of hyperparameters
xgb_grid <- expand.grid(
  nrounds = c(100, 300, 500),
  max_depth = c(3, 5, 7),
  eta = c(0.01, 0.05, 0.1),
  gamma = 0,
  colsample_bytree = c(0.7, 0.8),
  min_child_weight = c(1, 5),
  subsample = 0.8
)

# Train with 5-fold cross-validation
ctrl <- trainControl(
  method = "cv",
  number = 5,
  verboseIter = FALSE
)

# Random search (select subset of grid)
set.seed(42)
xgb_tuned <- train(
  medv ~ ., data = train,
  method = "xgbTree",
  trControl = ctrl,
  tuneGrid = xgb_grid[sample(nrow(xgb_grid), 30), ],
  verbose = FALSE
)

# Best parameters
cat("Best parameters:\n")
print(xgb_tuned$bestTune)

# Evaluate on test set
pred_tuned <- predict(xgb_tuned, newdata = test)
r2_tuned <- 1 - sum((test$medv - pred_tuned)^2) /
                 sum((test$medv - mean(test$medv))^2)
cat("Test R-squared:", round(r2_tuned, 4), "\n")
Python Output
Best parameters:
  subsample: 0.8
  n_estimators: 500
  min_child_weight: 3
  max_depth: 5
  learning_rate: 0.05
  colsample_bytree: 0.8

Best CV RMSE: 0.4712
Test R-squared: 0.8456
Stata Output
Best params: {'subsample': 0.8, 'n_estimators': 300, 'max_depth': 5, 'learning_rate': 0.05}
Best CV RMSE: 1756.32
R Output
Best parameters:
  nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
      300         5 0.05     0              0.8                5       0.8

Test R-squared: 0.9023

Interpreting Predictions with SHAP

What Are SHAP Values For?

Variable importance tells you which features matter globally. SHAP values go deeper: they explain how each feature affects each individual prediction, and in which direction. If your XGBoost model predicts that a specific household will default on a loan, SHAP tells you why: maybe high debt-to-income ratio pushed the prediction up by 15 percentage points, while long credit history pulled it down by 8 points. SHAP makes black-box models auditable and interpretable — essential for communicating ML results to policymakers, regulators, and non-technical stakeholders.

SHAP values (SHapley Additive exPlanations), developed by Lundberg and Lee in 2017, are based on Shapley values from cooperative game theory. The core idea: imagine a house is predicted to cost $450,000, while the average prediction is $350,000. SHAP assigns a portion of that $100,000 difference to each feature: maybe "median income in the neighborhood" contributes +$70,000, "proximity to the coast" contributes +$40,000, and "house age" contributes -$10,000. These contributions always sum to exactly the difference between the prediction and the baseline (this is a mathematical guarantee).

The most common SHAP visualization is the summary plot (beeswarm plot): for each feature, it shows a dot for every observation, colored by the feature value (red = high, blue = low) and positioned along the x-axis according to the SHAP value (positive = pushes prediction up, negative = pushes it down). This gives you both the importance and the direction of each feature's effect in a single plot.

Economics example: In a model predicting household poverty using survey data, SHAP values can reveal that having a female household head increases the predicted poverty probability by 5 percentage points for urban households but only 2 points for rural ones — uncovering heterogeneity that a simple variable importance ranking would miss entirely.

import shap

# Use the fitted XGBoost model from earlier
# TreeExplainer is optimized for tree-based models
explainer = shap.TreeExplainer(xgb_model)

# Compute SHAP values for the test set
shap_values = explainer.shap_values(X_test)

# Summary plot (beeswarm): shows feature importance + direction
shap.summary_plot(shap_values, X_test, show=False)
plt.title("SHAP Summary Plot")
plt.tight_layout()
plt.show()

# Force plot for a single prediction (observation 0)
print(f"\nPrediction for observation 0: {xgb_model.predict(X_test.iloc[[0]])[0]:.3f}")
print(f"Base value (mean prediction): {explainer.expected_value:.3f}")

# Show top SHAP contributors for this observation
obs_shap = pd.Series(shap_values[0], index=X.columns)
obs_shap = obs_shap.reindex(obs_shap.abs().sort_values(ascending=False).index)
print("\nSHAP values for observation 0:")
print(obs_shap.round(4))
* Stata: SHAP values via Python block
* Native Stata does not have SHAP support.

python:
import shap
import xgboost as xgb
import numpy as np
from sfi import Data

# Get data (reusing variables from earlier)
price  = np.array(Data.get("price"))
mpg    = np.array(Data.get("mpg"))
weight = np.array(Data.get("weight"))
length = np.array(Data.get("length"))
tr     = np.array(Data.get("train"))

X = np.column_stack([mpg, weight, length])
y = price
mask = (tr == 1)

# Fit XGBoost
model = xgb.XGBRegressor(n_estimators=300, max_depth=5, learning_rate=0.1)
model.fit(X[mask], y[mask])

# Compute SHAP values
explainer = shap.TreeExplainer(model)
shap_vals = explainer.shap_values(X[~mask])

# Display mean absolute SHAP values
feature_names = ["mpg", "weight", "length"]
mean_shap = np.abs(shap_vals).mean(axis=0)
for name, val in sorted(zip(feature_names, mean_shap), key=lambda x: -x[1]):
    print(f"  {name:10s}: {val:.2f}")
end
# R: SHAP values with the shapviz package
library(shapviz)

# Feature matrix (no outcome column)
X_test_mat <- as.matrix(test[, -which(names(test) == "medv")])

# Create shapviz object (computes SHAP values internally)
shap_obj <- shapviz(xgb_model, X_pred = xgb.DMatrix(X_test_mat, label = test$medv))

# Beeswarm plot (SHAP summary)
sv_importance(shap_obj, kind = "beeswarm") +
  ggtitle("SHAP Beeswarm Plot (XGBoost)")

# Mean absolute SHAP values (global importance)
sv_importance(shap_obj, kind = "bar") +
  ggtitle("Mean |SHAP| Values")
Python Output
Prediction for observation 0: 0.478
Base value (mean prediction): 2.069

SHAP values for observation 0:
MedInc       -1.2341
AveOccup     -0.3412
Latitude      0.1523
Longitude     0.0934
AveRooms     -0.0876
HouseAge     -0.0543
AveBedrms    -0.0987
Population   -0.0208

Generated figures:

SHAP Summary Plot 0.0 -1.0 1.0 2.0 -2.0 SHAP value (impact on model output) MedInc AveOccup Latitude Longitude HouseAge AveRooms Population Low High Feature value
Stata Output
Mean |SHAP| values:
  weight    : 1523.45
  mpg       :  892.31
  length    :  456.78
R Output
[shapviz beeswarm plot displayed with features ranked by importance.
 lstat and rm show the widest SHAP value spreads.
 High lstat values push predictions down; high rm values push them up.]

[shapviz bar plot displayed showing mean |SHAP| values:
 lstat: 3.12, rm: 2.45, dis: 0.54, crim: 0.41, nox: 0.39, ...]

Trees vs. Regularization

Both tree-based methods and regularized regression (LASSO, Ridge, Elastic Net from Module 11A) are powerful prediction tools, but they have different strengths. The choice depends on the nature of your data and problem.

Tree-Based Methods (RF, XGBoost)

  • Automatically capture nonlinearities and interactions
  • No need for feature engineering (polynomials, interactions)
  • Robust to outliers and skewed distributions
  • Handle mixed data types (numeric + categorical) naturally
  • Do not require feature scaling or normalization
  • Best for: complex nonlinear relationships; many features with interactions; tabular data where accuracy matters most

Regularized Regression (LASSO, Ridge)

  • Produce interpretable, sparse models (especially LASSO)
  • Work well when the true relationship is approximately linear
  • Computationally faster to train and tune
  • LASSO performs automatic variable selection
  • Better with small sample sizes (fewer parameters to estimate)
  • Best for: inference-adjacent tasks; feature selection; linear or mildly nonlinear settings; small n

When to Use What: A Quick Decision Guide

Nonlinear relationships? No LASSO / Ridge Simple, interpretable Yes Need max accuracy? No Random Forest Yes XGBoost

This is a rough guide. In practice, try both approaches and compare via cross-validation.

Limitations

Key Limitations of Tree-Based Methods

  • Single trees are unstable: Small changes in the training data can produce a completely different tree structure. This is why ensembles (Random Forests, boosting) are almost always preferred over single trees in practice.
  • Trees cannot extrapolate: Because predictions are averages of training observations in each leaf, trees can never predict values outside the range seen during training. If your training data has house prices between $100K and $1M, the model will never predict $1.5M. This is a fundamental limitation compared to linear models.
  • Computational cost: Large Random Forests and XGBoost models can be slow to train on very large datasets, though parallelization helps. XGBoost is significantly faster than a naive implementation but still requires more computation than a LASSO regression.
  • Interpretability trade-off: While single trees are highly interpretable, ensembles of hundreds of trees are black boxes. SHAP values and variable importance help, but they do not provide the simple, transparent model that a policymaker or regulator might want.
  • Not designed for causal inference: Tree-based methods are prediction machines. High variable importance does not imply causation. For causal questions, use the methods from Module 6 (or see Module 11D: Causal ML for hybrid approaches like causal forests).

References

  • Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Chapters 9, 10, 15.
  • Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems (NeurIPS).