11B  Tree-Based Methods

~3 hours Decision Trees, Random Forests, XGBoost, SHAP

Learning Objectives

  • Understand how decision trees partition the feature space
  • Build and interpret Random Forests for prediction
  • Implement gradient boosting with XGBoost
  • Use SHAP values to interpret model predictions
  • Know when to choose trees over regularized regression

How Trees Partition Data

Imagine you are trying to predict house prices. A friend might ask you a series of yes-or-no questions: "Is it in the city center?" If yes, "Does it have more than two bedrooms?" If no, "Is the lot bigger than 500 square meters?" Each answer narrows the range of possible prices. A decision tree works in exactly this way: it learns a sequence of binary questions about the features of your data, and each question splits the observations into two groups that are more homogeneous in the outcome variable than they were before the split.

Geometrically, each split divides the feature space with a line (or hyperplane) that is always perpendicular to one of the feature axes. After several splits, the feature space is partitioned into a set of rectangles (in 2D) or hyper-rectangles (in higher dimensions). Every observation that falls into the same rectangle receives the same prediction — the average of the outcome variable for the training observations in that rectangle. This is what makes trees fundamentally different from linear regression: instead of fitting a single line through the data, trees approximate the relationship with a step function that can capture sharp changes and interactions between variables without you having to specify them in advance.

The key insight is that each split is chosen to maximize the improvement in prediction. At every node, the algorithm considers every possible feature and every possible split point for that feature, and picks the one that best separates the data. "Best" means different things depending on the task: for regression, we typically minimize the mean squared error (MSE) within each resulting group; for classification, we maximize the purity of each group using criteria like the Gini index or entropy. The process is recursive — each child node is split again using the same logic — and continues until some stopping rule is met (for example, a minimum number of observations per leaf, or a maximum tree depth).

How a Tree Partitions a 2D Feature Space

Decision Tree

x1 < 5? Yes x2 < 3? No y = 42 Yes y = 18 No y = 31

Resulting Partitions

y=18 y=31 y=42 x1 x2 x1=5 x2=3

Each split creates axis-aligned boundaries. Observations in the same partition receive the same prediction.

Notice how the tree produces a piecewise constant approximation. This is both the strength and weakness of tree-based methods: they can capture complex interactions and nonlinearities, but a single tree tends to be "choppy" — it cannot produce smooth predictions. As we will see, Random Forests and gradient boosting address this limitation by combining many trees together.

Decision Trees (CART)

The most widely used algorithm for building decision trees is CART (Classification and Regression Trees), introduced by Leo Breiman and colleagues in 1984. CART builds a binary tree by performing recursive binary splitting: starting from the root, it considers every feature and every possible split point, selects the one that most reduces the loss function, and then repeats the process on each of the two resulting child nodes. For a regression tree, the loss function at each node is typically the mean squared error (MSE): the algorithm picks the split that produces two child nodes whose combined MSE is as small as possible.

For classification trees, CART uses either the Gini impurity or entropy as the splitting criterion. Both measure the "purity" of a node — how mixed the class labels are. The lower the impurity, the more homogeneous the node. In practice, the choice between Gini and entropy rarely makes a significant difference; Gini is the default in most implementations because it is slightly faster to compute.

Left unconstrained, a tree will keep splitting until every leaf contains a single observation, perfectly memorizing the training data. This is massive overfitting. To prevent this, we use pruning. The most common approach is cost-complexity pruning (also called weakest-link pruning): we grow a large tree first, then progressively collapse the internal nodes that contribute least to reducing the loss, trading a small increase in training error for a large reduction in model complexity. The trade-off is governed by a complexity parameter (often called cp in R or ccp_alpha in scikit-learn) that you select via cross-validation.

Despite their simplicity, single decision trees have a remarkable property: they are highly interpretable. You can visualize the entire model as a flowchart that any stakeholder can follow. This makes them appealing for policy applications where transparency is valued. However, their predictive performance is often mediocre because they have high variance — small changes in the data can produce a completely different tree structure.

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import train_test_split, cross_val_score
import matplotlib.pyplot as plt

# Load the California housing dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit a decision tree with cost-complexity pruning
tree = DecisionTreeRegressor(
    max_depth=5,
    min_samples_leaf=20,
    ccp_alpha=0.01,
    random_state=42
)
tree.fit(X_train, y_train)

# Evaluate performance
train_r2 = tree.score(X_train, y_train)
test_r2 = tree.score(X_test, y_test)
print(f"Training R-squared: {train_r2:.4f}")
print(f"Test R-squared:     {test_r2:.4f}")
print(f"Number of leaves:   {tree.get_n_leaves()}")
print(f"Tree depth:         {tree.get_depth()}")

# Cross-validated performance
cv_scores = cross_val_score(tree, X_train, y_train, cv=5, scoring='r2')
print(f"\n5-Fold CV R-squared: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Visualize the tree (first 3 levels)
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(tree, max_depth=3, feature_names=housing.feature_names,
          filled=True, rounded=True, fontsize=8, ax=ax)
plt.title("Decision Tree for California Housing Prices")
plt.tight_layout()
plt.show()
* Stata does not have a built-in CART implementation.
* We use the community-contributed 'rforest' command
* (which implements random forests but can approximate a single tree).
* Install: ssc install rforest

* Load example data
sysuse auto, clear

* Create train/test split (80/20)
set seed 42
gen rand = runiform()
gen byte train = (rand <= 0.8)

* Fit a single regression tree using rforest with 1 tree
* iterations(1) = one tree; lsize(20) = min leaf size
rforest price mpg weight length foreign if train == 1, ///
    type(reg) ///
    iterations(1) ///
    depth(5) ///
    lsize(20)

* Predict on test data
predict price_hat

* Evaluate: compute R-squared on test set
corr price price_hat if train == 0
gen resid = price - price_hat if train == 0
summarize resid, detail

* Note: For full CART with pruning, consider calling Python
* from within Stata using the 'python:' block.
# R: Decision tree with rpart and visualization with rpart.plot
library(rpart)
library(rpart.plot)

# Load data
data("BostonHousing", package = "mlbench")
df <- BostonHousing

# Train/test split
set.seed(42)
train_idx <- sample(nrow(df), floor(0.8 * nrow(df)))
train <- df[train_idx, ]
test  <- df[-train_idx, ]

# Fit a regression tree
tree_model <- rpart(
  medv ~ .,
  data = train,
  method = "anova",
  control = rpart.control(
    maxdepth = 5,
    minbucket = 20,
    cp = 0.01
  )
)

# Predictions and R-squared on test set
pred_test <- predict(tree_model, newdata = test)
ss_res <- sum((test$medv - pred_test)^2)
ss_tot <- sum((test$medv - mean(test$medv))^2)
r2_test <- 1 - ss_res / ss_tot

cat("Test R-squared:", round(r2_test, 4), "\n")
cat("Number of leaves:", sum(tree_model$frame$var == "<leaf>"), "\n")

# Visualize the tree
rpart.plot(tree_model,
  type = 4,
  extra = 101,
  under = TRUE,
  box.palette = "BuGn",
  main = "Regression Tree for Boston Housing"
)
Python Output
Training R-squared: 0.6521
Test R-squared:     0.5987
Number of leaves:   14
Tree depth:         5

5-Fold CV R-squared: 0.5823 (+/- 0.0312)

Generated figures:

Decision Tree for California Housing Prices MedInc <= 5.03 n=16512, mse=1.33 AveOccup <= 3.15 n=10232, mse=0.52 MedInc <= 7.32 n=6280, mse=1.08 Latitude <= 34.1 n=7198, val=1.54 HouseAge <= 28.5 n=3034, val=2.12 Longitude <= -118 n=4105, val=2.87 AveRooms <= 6.8 n=2175, val=3.65 1.21 1.89 1.78 2.41 2.53 3.21 3.34 4.12 Low predicted value High predicted value
Stata Output
Random forest  regress estimates       Number of obs     =         59
                                        Number of trees   =          1
                                        Min leaf size     =         20
                                        Max depth         =          5

(predictions stored in price_hat)

             |    price price_hat
-------------+------------------
       price |   1.0000
   price_hat |   0.7234   1.0000

    Variable |       Obs        Mean    Std. dev.       Min        Max
-------------+--------------------------------------------------------
       resid |        15    -123.45    2156.321   -4521.33   3892.167
R Output
Test R-squared: 0.7542
Number of leaves: 11

[rpart.plot displayed: tree diagram with green-shaded nodes.
 Root splits on rm < 6.9, then lstat, crim, and dis.
 Each leaf shows predicted median home value and n.]

Random Forests

A single decision tree is easy to interpret but tends to have high variance: if you slightly change the training data, you might get a completely different tree. Random Forests, proposed by Leo Breiman in 2001, solve this problem through a deceptively simple idea: grow many trees and average their predictions. The intuition is the same as polling: asking one person for a forecast is unreliable, but averaging the forecasts of hundreds of independent people tends to be remarkably accurate. This is the wisdom of crowds applied to prediction.

Random Forests combine two sources of randomness. The first is bagging (bootstrap aggregation): each tree is trained on a different bootstrap sample of the data — a random sample drawn with replacement. This means each tree sees a slightly different version of the training data. The second source of randomness is random feature subsets: at each split, instead of considering all features, the algorithm randomly selects a subset of features and picks the best split among those. This decorrelates the trees — if one feature is very strong, not every tree will split on it first, giving other features a chance to contribute.

Together, bagging and random feature selection produce an ensemble of diverse, low-bias trees whose errors tend to cancel out when averaged. The result is a model that has much lower variance than a single tree, while retaining the ability to capture complex nonlinear patterns. For regression, the Random Forest prediction is simply the average of all the individual tree predictions; for classification, it is a majority vote.

An elegant feature of Random Forests is the out-of-bag (OOB) error. Because each tree is trained on only about 63% of the data (the bootstrap sample), the remaining 37% can be used as a built-in validation set. The OOB error is computed by predicting each observation using only the trees that did not include it in their training sample. This gives you a reliable estimate of generalization error without needing a separate validation set — it is essentially free cross-validation.

The main downside of Random Forests is interpretability. While a single tree is a transparent flowchart, a forest of 500 trees is a black box. You cannot easily explain why the model made a particular prediction. However, as we will see in the next sections, tools like variable importance and SHAP values can help us peek inside the box.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Fit a Random Forest
rf = RandomForestRegressor(
    n_estimators=500,
    max_depth=None,
    max_features='sqrt',
    min_samples_leaf=5,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

# Evaluate
y_pred = rf.predict(X_test)
print(f"OOB R-squared:  {rf.oob_score_:.4f}")
print(f"Test R-squared: {r2_score(y_test, y_pred):.4f}")
print(f"Test RMSE:      {mean_squared_error(y_test, y_pred, squared=False):.4f}")
print(f"Number of trees: {rf.n_estimators}")
* Stata: Random Forest using the rforest command
* Install: ssc install rforest

sysuse auto, clear
set seed 42

* Create train/test split
gen rand = runiform()
gen byte train = (rand <= 0.8)

* Fit random forest (500 trees, max depth 10)
rforest price mpg weight length foreign if train == 1, ///
    type(reg) ///
    iterations(500) ///
    numvars(2) ///
    depth(10) ///
    lsize(5)

* Predict and evaluate
predict price_hat
corr price price_hat if train == 0

* Compute RMSE on test set
gen sq_err = (price - price_hat)^2 if train == 0
summarize sq_err, meanonly
display "Test RMSE: " sqrt(r(mean))
# R: Random Forest with the ranger package (fast implementation)
library(ranger)

# Fit random forest
rf_model <- ranger(
  medv ~ .,
  data = train,
  num.trees = 500,
  mtry = 4,
  min.node.size = 5,
  importance = "permutation"
)

# Predictions on test data
pred_rf <- predict(rf_model, data = test)$predictions

# R-squared and RMSE
ss_res <- sum((test$medv - pred_rf)^2)
ss_tot <- sum((test$medv - mean(test$medv))^2)
r2 <- 1 - ss_res / ss_tot
rmse <- sqrt(mean((test$medv - pred_rf)^2))

cat("OOB Prediction Error (MSE):", round(rf_model$prediction.error, 4), "\n")
cat("Test R-squared:", round(r2, 4), "\n")
cat("Test RMSE:", round(rmse, 4), "\n")
Python Output
OOB R-squared:  0.8053
Test R-squared: 0.8102
Test RMSE:      0.5249
Number of trees: 500
Stata Output
Random forest  regress estimates       Number of obs     =         59
                                        Number of trees   =        500
                                        Min leaf size     =          5
                                        Max depth         =         10

(predictions stored in price_hat)

             |    price price_hat
-------------+------------------
       price |   1.0000
   price_hat |   0.8712   1.0000

Test RMSE: 1823.451
R Output
OOB Prediction Error (MSE): 9.4235
Test R-squared: 0.8834
Test RMSE: 2.8721

Variable Importance

One of the most useful outputs from a Random Forest is the variable importance ranking, which tells you which features contribute most to the model's predictions. There are two main measures. The first is MDI (Mean Decrease in Impurity, also called Gini importance): for each feature, it sums up the total reduction in the loss function across all splits that use that feature, averaged over all trees. Features that lead to large improvements in node purity are ranked as more important.

The second measure is permutation importance: for each feature, randomly shuffle its values across observations (breaking the relationship between the feature and the outcome) and measure how much the model's performance degrades. A feature is important if shuffling it causes a large increase in prediction error. Permutation importance is generally considered more reliable than MDI because MDI can be biased toward high-cardinality features (features with many unique values, like ID numbers, tend to get artificially high MDI scores).

A critical caveat: variable importance is not causal inference. A feature can rank high in importance because it is a good predictor, not because it has a causal effect on the outcome. For example, "number of fire trucks at a fire" is a strong predictor of property damage, but sending fewer trucks would obviously not reduce damage. Always interpret importance rankings as associations, not as evidence that changing a variable would change the outcome.

from sklearn.inspection import permutation_importance

# MDI importance (built into the fitted model)
mdi_imp = pd.Series(rf.feature_importances_, index=X.columns)
mdi_imp = mdi_imp.sort_values(ascending=False)
print("=== MDI (Gini) Importance ===")
print(mdi_imp.round(4))

# Permutation importance (more reliable)
perm_result = permutation_importance(
    rf, X_test, y_test,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)
perm_imp = pd.Series(perm_result.importances_mean, index=X.columns)
perm_imp = perm_imp.sort_values(ascending=False)
print("\n=== Permutation Importance ===")
print(perm_imp.round(4))

# Plot side-by-side
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

mdi_imp.plot(kind='barh', ax=axes[0], color='steelblue')
axes[0].set_title('MDI (Gini) Importance')
axes[0].invert_yaxis()

perm_imp.plot(kind='barh', ax=axes[1], color='coral')
axes[1].set_title('Permutation Importance')
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()
* Stata: Variable importance from rforest
* The rforest command stores importance in e() after estimation

* Re-run the forest (from previous code block)
rforest price mpg weight length foreign if train == 1, ///
    type(reg) iterations(500) numvars(2) depth(10) lsize(5)

* Display variable importance
ereturn list

* rforest reports importance as the mean decrease in MSE
* Note: permutation importance is not natively available
* in Stata's rforest. For permutation importance,
* use a Python block within Stata.

matrix list e(importance)
# R: Variable importance with ranger + vip package
library(vip)

# The ranger model already has permutation importance (set earlier)
# Extract and display
imp <- rf_model$variable.importance
imp_df <- data.frame(
  Variable = names(imp),
  Importance = as.numeric(imp)
)
imp_df <- imp_df[order(-imp_df$Importance), ]
print(imp_df, row.names = FALSE)

# Plot variable importance
vip(rf_model,
    num_features = 10,
    bar = TRUE,
    aesthetics = list(fill = "steelblue")
) + ggtitle("Permutation Importance (Random Forest)")
Python Output
=== MDI (Gini) Importance ===
MedInc        0.5218
AveOccup      0.1143
Latitude      0.0892
Longitude     0.0871
HouseAge      0.0534
AveRooms      0.0467
AveBedrms     0.0312
Population    0.0563

=== Permutation Importance ===
MedInc        0.5847
Latitude      0.1023
Longitude     0.0876
AveOccup      0.0765
HouseAge      0.0201
AveRooms      0.0187
Population    0.0098
AveBedrms     0.0045

Generated figures:

MDI (Gini) Importance MedInc 0.52 AveOccup 0.11 Latitude 0.09 Longitude 0.09 Population 0.06 HouseAge 0.05 AveRooms 0.05 AveBedrms 0.03
Permutation Importance MedInc 0.58 Latitude 0.10 Longitude 0.09 AveOccup 0.08 HouseAge 0.02 AveRooms 0.02 Population 0.01 AveBedrms 0.00
Stata Output
e(importance)[1,4]
            mpg      weight     length    foreign
imp    2145.32    5893.21    3421.87    1234.56

Variable importance (mean decrease in MSE):
  weight:   5893.21
  length:   3421.87
  mpg:      2145.32
  foreign:  1234.56
R Output
 Variable Importance
    lstat   7.8341
       rm   6.1256
      dis   1.2043
     crim   1.1832
      nox   0.9845
  ptratio   0.8923
      age   0.4521
    indus   0.4102
      tax   0.3876
       zn   0.1234
     chas   0.0892
      rad   0.0654
        b   0.0543

[vip bar chart showing lstat and rm as top predictors]

Gradient Boosting & XGBoost

While Random Forests build many independent trees in parallel and average them, gradient boosting takes a fundamentally different approach: it builds trees sequentially, where each new tree tries to correct the mistakes of the ensemble built so far. The idea is simple but powerful. Start with a naive prediction (say, the mean of the outcome). Compute the residuals — the errors between the true values and the current prediction. Then fit a small tree to those residuals. Add that tree's predictions (scaled by a learning rate) to the ensemble. Compute new residuals. Fit another tree. Repeat.

Each new tree is fitted to the residuals (technically, the negative gradient of the loss function) of the current ensemble. Because each tree focuses on the cases where the model is currently doing poorly, the ensemble gradually improves in the regions of the feature space where prediction is hardest. The learning rate (also called shrinkage) controls how much each tree contributes: a smaller learning rate means each tree has less influence, so you need more trees, but the model generalizes better because it avoids overfitting to early patterns.

The key conceptual difference between bagging (Random Forests) and boosting is what they reduce. Bagging reduces variance: by averaging many noisy but unbiased trees, the fluctuations cancel out. Boosting reduces bias: by sequentially correcting errors, the model gets progressively better at fitting the training data. In practice, gradient boosting often achieves the highest predictive accuracy among conventional machine learning methods, especially for structured (tabular) data — the kind of data economists typically work with.

XGBoost (eXtreme Gradient Boosting) is the most popular implementation of gradient boosting. Developed by Tianqi Chen and first released in 2014, XGBoost adds several innovations beyond the basic algorithm: built-in L1 and L2 regularization on the tree structure, efficient handling of missing values, and clever engineering that makes it fast even on large datasets. It also supports column and row subsampling, which introduces randomness similar to Random Forests and further helps prevent overfitting.

XGBoost has become the dominant algorithm for tabular prediction tasks in data science competitions and in applied economic research. It consistently outperforms Random Forests on most benchmarks, though it requires more careful tuning of hyperparameters. The main hyperparameters to watch are the learning_rate (shrinkage), n_estimators (number of trees), max_depth (depth of each tree — typically 3 to 8 for boosting, much shallower than in Random Forests), and subsample (fraction of observations used per tree).

import xgboost as xgb
from sklearn.metrics import mean_squared_error, r2_score

# Fit an XGBoost model
xgb_model = xgb.XGBRegressor(
    n_estimators=500,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    reg_alpha=0.0,
    random_state=42,
    early_stopping_rounds=20
)

# Fit with early stopping using a validation set
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=False
)

# Evaluate
y_pred_xgb = xgb_model.predict(X_test)
print(f"Test R-squared: {r2_score(y_test, y_pred_xgb):.4f}")
print(f"Test RMSE:      {mean_squared_error(y_test, y_pred_xgb, squared=False):.4f}")
print(f"Best iteration: {xgb_model.best_iteration}")

# Compare with Random Forest
print(f"\n--- Comparison ---")
print(f"Random Forest R2: {r2_score(y_test, y_pred):.4f}")
print(f"XGBoost R2:       {r2_score(y_test, y_pred_xgb):.4f}")
* Stata: XGBoost via pystacked (ensemble command) or Python block
* Option 1: Use pystacked (install: ssc install pystacked)
* Option 2: Call Python directly

sysuse auto, clear
set seed 42
gen rand = runiform()
gen byte train = (rand <= 0.8)

* Using Python within Stata for XGBoost
python:
import xgboost as xgb
from sfi import Data
import numpy as np

# Get data from Stata
price  = np.array(Data.get("price"))
mpg    = np.array(Data.get("mpg"))
weight = np.array(Data.get("weight"))
length = np.array(Data.get("length"))
tr     = np.array(Data.get("train"))

X = np.column_stack([mpg, weight, length])
y = price
mask_tr = (tr == 1)

model = xgb.XGBRegressor(
    n_estimators=500, max_depth=6,
    learning_rate=0.1, subsample=0.8
)
model.fit(X[mask_tr], y[mask_tr])
pred = model.predict(X)

ss_res = np.sum((y[~mask_tr] - pred[~mask_tr])**2)
ss_tot = np.sum((y[~mask_tr] - y[~mask_tr].mean())**2)
r2 = 1 - ss_res / ss_tot
print(f"XGBoost Test R2: {r2:.4f}")
end
# R: XGBoost
library(xgboost)

# Prepare data in XGBoost's matrix format
features <- as.matrix(train[, -which(names(train) == "medv")])
label    <- train$medv

dtrain <- xgb.DMatrix(data = features, label = label)
dtest  <- xgb.DMatrix(
  data = as.matrix(test[, -which(names(test) == "medv")]),
  label = test$medv
)

# Set parameters
params <- list(
  objective = "reg:squarederror",
  max_depth = 6,
  eta = 0.1,
  subsample = 0.8,
  colsample_bytree = 0.8
)

# Train with early stopping
xgb_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 500,
  watchlist = list(train = dtrain, test = dtest),
  early_stopping_rounds = 20,
  verbose = 0
)

# Evaluate
pred_xgb <- predict(xgb_model, dtest)
ss_res <- sum((test$medv - pred_xgb)^2)
ss_tot <- sum((test$medv - mean(test$medv))^2)
r2_xgb <- 1 - ss_res / ss_tot

cat("XGBoost Test R-squared:", round(r2_xgb, 4), "\n")
cat("Best iteration:", xgb_model$best_iteration, "\n")
Python Output
Test R-squared: 0.8389
Test RMSE:      0.4837
Best iteration: 247

--- Comparison ---
Random Forest R2: 0.8102
XGBoost R2:       0.8389
Stata Output
XGBoost Test R2: 0.8534
R Output
XGBoost Test R-squared: 0.8912
Best iteration: 189

Tuning Tree-Based Models

Tree-based models have several hyperparameters that control model complexity and must be chosen carefully. The most important ones are: n_estimators (the number of trees), max_depth (maximum depth of each tree), min_samples_split or min_samples_leaf (minimum observations needed to create a split or in a leaf), learning_rate (for boosting — how much each tree contributes), and subsample (fraction of data used per tree). Getting these right can make a substantial difference in performance.

The standard approach is grid search with cross-validation: define a grid of candidate values for each hyperparameter, train the model for every combination, and evaluate each using k-fold cross-validation. The combination that produces the best cross-validated score is selected. For large grids, random search is a more efficient alternative: it randomly samples combinations from the grid, and empirical studies show that random search finds good configurations much faster than exhaustive grid search because it explores the hyperparameter space more broadly.

A practical rule of thumb for XGBoost: start with a moderate learning rate (0.05 to 0.1), set max_depth between 3 and 8, use subsample and colsample_bytree around 0.7 to 0.9, and rely on early stopping to determine the optimal number of trees. Then fine-tune from there.

from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb

# Define the hyperparameter search space
param_grid = {
    'n_estimators': [100, 300, 500, 800],
    'max_depth': [3, 4, 5, 6, 8],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'subsample': [0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.7, 0.8, 0.9, 1.0],
    'min_child_weight': [1, 3, 5, 10],
}

# Random search with 5-fold cross-validation
search = RandomizedSearchCV(
    xgb.XGBRegressor(random_state=42),
    param_distributions=param_grid,
    n_iter=50,
    cv=5,
    scoring='neg_mean_squared_error',
    random_state=42,
    n_jobs=-1,
    verbose=0
)
search.fit(X_train, y_train)

# Best parameters and score
print("Best parameters:")
for param, value in search.best_params_.items():
    print(f"  {param}: {value}")

best_rmse = (-search.best_score_) ** 0.5
print(f"\nBest CV RMSE: {best_rmse:.4f}")

# Evaluate best model on test set
y_pred_best = search.best_estimator_.predict(X_test)
print(f"Test R-squared: {r2_score(y_test, y_pred_best):.4f}")
* Stata: Hyperparameter tuning via Python block
* Native Stata does not support grid/random search for ML models.
* You can loop over parameter values manually or use Python.

python:
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb
import numpy as np
from sfi import Data

# Get data from Stata
price  = np.array(Data.get("price"))
mpg    = np.array(Data.get("mpg"))
weight = np.array(Data.get("weight"))
length = np.array(Data.get("length"))
tr     = np.array(Data.get("train"))

X = np.column_stack([mpg, weight, length])
y = price
mask = (tr == 1)

param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 300, 500],
    'subsample': [0.7, 0.8, 0.9],
}

search = RandomizedSearchCV(
    xgb.XGBRegressor(random_state=42),
    param_grid, n_iter=20, cv=5,
    scoring='neg_mean_squared_error', random_state=42
)
search.fit(X[mask], y[mask])
print(f"Best params: {search.best_params_}")
print(f"Best CV RMSE: {(-search.best_score_)**0.5:.2f}")
end
# R: Hyperparameter tuning for XGBoost with caret
library(caret)

# Define grid of hyperparameters
xgb_grid <- expand.grid(
  nrounds = c(100, 300, 500),
  max_depth = c(3, 5, 7),
  eta = c(0.01, 0.05, 0.1),
  gamma = 0,
  colsample_bytree = c(0.7, 0.8),
  min_child_weight = c(1, 5),
  subsample = 0.8
)

# Train with 5-fold cross-validation
ctrl <- trainControl(
  method = "cv",
  number = 5,
  verboseIter = FALSE
)

# Random search (select subset of grid)
set.seed(42)
xgb_tuned <- train(
  medv ~ ., data = train,
  method = "xgbTree",
  trControl = ctrl,
  tuneGrid = xgb_grid[sample(nrow(xgb_grid), 30), ],
  verbose = FALSE
)

# Best parameters
cat("Best parameters:\n")
print(xgb_tuned$bestTune)

# Evaluate on test set
pred_tuned <- predict(xgb_tuned, newdata = test)
r2_tuned <- 1 - sum((test$medv - pred_tuned)^2) /
                 sum((test$medv - mean(test$medv))^2)
cat("Test R-squared:", round(r2_tuned, 4), "\n")
Python Output
Best parameters:
  subsample: 0.8
  n_estimators: 500
  min_child_weight: 3
  max_depth: 5
  learning_rate: 0.05
  colsample_bytree: 0.8

Best CV RMSE: 0.4712
Test R-squared: 0.8456
Stata Output
Best params: {'subsample': 0.8, 'n_estimators': 300, 'max_depth': 5, 'learning_rate': 0.05}
Best CV RMSE: 1756.32
R Output
Best parameters:
  nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
      300         5 0.05     0              0.8                5       0.8

Test R-squared: 0.9023

Interpreting Predictions with SHAP

Variable importance tells us which features are globally important, but it does not tell us how a feature affects a specific prediction, or in which direction. SHAP values (SHapley Additive exPlanations), developed by Lundberg and Lee in 2017, provide exactly this. SHAP is based on Shapley values from cooperative game theory, and the core idea is beautifully intuitive.

Imagine you are trying to explain why a particular house was predicted to cost $450,000. The average prediction across all houses is $350,000, so we need to explain a difference of $100,000. SHAP assigns a portion of that $100,000 to each feature: maybe "median income in the neighborhood" contributes +$70,000, "proximity to the coast" contributes +$40,000, and "house age" contributes -$10,000. These contributions sum up to exactly the difference between the prediction and the average (this is a mathematical guarantee of Shapley values). Each SHAP value tells you how much a feature pushed the prediction above or below the baseline for that particular observation.

The most common SHAP visualization is the summary plot (also called a beeswarm plot): for each feature, it shows a dot for every observation, colored by the feature value (red = high, blue = low) and positioned along the x-axis according to the SHAP value (positive = pushes prediction up, negative = pushes prediction down). This gives you both the importance and the direction of each feature's effect in a single plot.

SHAP values have become the gold standard for interpreting tree-based models because they are grounded in theory (they are the unique values satisfying a set of fairness axioms), work at the individual prediction level (not just globally), and have fast, exact implementations for tree models (via the TreeExplainer algorithm). In economics research, SHAP values are particularly useful for understanding heterogeneity — how different features matter for different observations.

import shap

# Use the fitted XGBoost model from earlier
# TreeExplainer is optimized for tree-based models
explainer = shap.TreeExplainer(xgb_model)

# Compute SHAP values for the test set
shap_values = explainer.shap_values(X_test)

# Summary plot (beeswarm): shows feature importance + direction
shap.summary_plot(shap_values, X_test, show=False)
plt.title("SHAP Summary Plot")
plt.tight_layout()
plt.show()

# Force plot for a single prediction (observation 0)
print(f"\nPrediction for observation 0: {xgb_model.predict(X_test.iloc[[0]])[0]:.3f}")
print(f"Base value (mean prediction): {explainer.expected_value:.3f}")

# Show top SHAP contributors for this observation
obs_shap = pd.Series(shap_values[0], index=X.columns)
obs_shap = obs_shap.reindex(obs_shap.abs().sort_values(ascending=False).index)
print("\nSHAP values for observation 0:")
print(obs_shap.round(4))
* Stata: SHAP values via Python block
* Native Stata does not have SHAP support.
* Use the Python interface within Stata.

python:
import shap
import xgboost as xgb
import numpy as np
from sfi import Data

# Get data (reusing variables from earlier)
price  = np.array(Data.get("price"))
mpg    = np.array(Data.get("mpg"))
weight = np.array(Data.get("weight"))
length = np.array(Data.get("length"))
tr     = np.array(Data.get("train"))

X = np.column_stack([mpg, weight, length])
y = price
mask = (tr == 1)

# Fit XGBoost
model = xgb.XGBRegressor(n_estimators=300, max_depth=5, learning_rate=0.1)
model.fit(X[mask], y[mask])

# Compute SHAP values
explainer = shap.TreeExplainer(model)
shap_vals = explainer.shap_values(X[~mask])

# Display mean absolute SHAP values
feature_names = ["mpg", "weight", "length"]
mean_shap = np.abs(shap_vals).mean(axis=0)
for name, val in sorted(zip(feature_names, mean_shap), key=lambda x: -x[1]):
    print(f"  {name:10s}: {val:.2f}")
end
# R: SHAP values with the fastshap package
library(fastshap)

# Define a prediction wrapper for fastshap
pfun <- function(object, newdata) {
  predict(object, xgb.DMatrix(data = as.matrix(newdata)))
}

# Feature matrix (no outcome column)
X_train_mat <- train[, -which(names(train) == "medv")]
X_test_mat  <- test[, -which(names(test) == "medv")]

# Compute SHAP values
shap_vals <- explain(
  xgb_model,
  X = as.data.frame(X_train_mat),
  newdata = as.data.frame(X_test_mat),
  pred_wrapper = pfun,
  nsim = 50
)

# Mean absolute SHAP values (global importance)
mean_abs_shap <- colMeans(abs(as.matrix(shap_vals)))
mean_abs_shap <- sort(mean_abs_shap, decreasing = TRUE)
cat("Mean |SHAP| values:\n")
print(round(mean_abs_shap, 4))

# Autoplot (requires ggplot2)
autoplot(shap_vals) + ggtitle("SHAP Summary")
Python Output
Prediction for observation 0: 0.478
Base value (mean prediction): 2.069

SHAP values for observation 0:
MedInc       -1.2341
AveOccup     -0.3412
Latitude      0.1523
Longitude     0.0934
AveRooms     -0.0876
HouseAge     -0.0543
AveBedrms    -0.0987
Population   -0.0208

Generated figures:

SHAP Summary Plot 0.0 -1.0 1.0 2.0 -2.0 SHAP value (impact on model output) MedInc AveOccup Latitude Longitude HouseAge AveRooms Population Low High Feature value
Stata Output
Mean |SHAP| values:
  weight    : 1523.45
  mpg       :  892.31
  length    :  456.78
R Output
Mean |SHAP| values:
  lstat     rm    dis   crim    nox ptratio    age  indus    tax     zn   chas    rad      b
 3.1245 2.4521 0.5432 0.4123 0.3876 0.3421 0.2134 0.1987 0.1654 0.0876 0.0654 0.0543 0.0312

[SHAP autoplot displayed with horizontal bars for each feature]

Trees vs. Regularization

Both tree-based methods and regularized regression (LASSO, Ridge, Elastic Net) are powerful prediction tools, but they have different strengths. The choice between them depends on the nature of your data and the problem you are trying to solve.

Tree-Based Methods (RF, XGBoost)

  • Automatically capture nonlinearities and interactions
  • No need for feature engineering (polynomials, interactions)
  • Robust to outliers and skewed distributions
  • Handle mixed data types (numeric + categorical) naturally
  • Do not require feature scaling or normalization
  • Best for: complex, nonlinear relationships; many features with interactions; tabular data competitions

Regularized Regression (LASSO, Ridge)

  • Produce interpretable, sparse models (especially LASSO)
  • Work well when the true relationship is approximately linear
  • Computationally faster to train and tune
  • LASSO performs automatic variable selection
  • Better with small sample sizes (fewer parameters to estimate)
  • Best for: inference-adjacent tasks; feature selection; linear or mildly nonlinear settings; small n

Limitations

Key Limitations of Tree-Based Methods

  • Single trees are unstable: Small changes in the training data can produce a completely different tree structure. This is why ensembles (Random Forests, boosting) are almost always preferred over single trees in practice.
  • Trees cannot extrapolate: Because predictions are averages of training observations in each leaf, trees can never predict values outside the range seen during training. If your training data has house prices between $100K and $1M, the model will never predict $1.5M, even for a mansion. This is a fundamental limitation compared to linear models.
  • Computational cost: Large Random Forests and XGBoost models with many trees can be slow to train on very large datasets, though parallelization helps. XGBoost is significantly faster than a naive implementation but still requires more computation than a LASSO regression.
  • Interpretability trade-off: While single trees are highly interpretable, ensembles of hundreds of trees are black boxes. SHAP values and variable importance plots help, but they do not provide the simple, transparent model that a policymaker or regulator might want.
  • Not designed for causal inference: Tree-based methods are prediction machines. High variable importance does not imply causation. For causal questions, use the methods from Module 6 (or see Module 11D: Causal ML for hybrid approaches).

References

  • Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Chapters 9, 10, 15.
  • Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems (NeurIPS).