11C  Neural Networks

~2.5 hours Architecture, Training, Regularization, Implementation

Learning Objectives

  • Understand how neural networks extend linear models with nonlinear activation functions
  • Build a feedforward neural network from scratch in Python (PyTorch)
  • Train networks using backpropagation and gradient descent
  • Apply regularization techniques (dropout, early stopping) to prevent overfitting
  • Know when neural networks are (and aren't) the right choice for economics research

From Linear Models to Neural Networks

You already know linear regression: the predicted outcome is a weighted sum of the features, y = Xβ. This is powerful and interpretable, but it can only learn linear relationships. If the true relationship between your features and the outcome is nonlinear (for example, the effect of education on earnings is different at different levels of education), a linear model will miss it unless you manually engineer the right transformations — adding squared terms, interactions, polynomials, and so on.

A neural network automates this process. The key idea is to stack multiple linear transformations with nonlinear activation functions between them. In the simplest terms: take your features X, multiply by a weight matrix W1, add a bias b1, then apply a nonlinear function (like ReLU, which just replaces negative values with zero). The result is a new set of "features" — called a hidden layer. Then take those hidden features, multiply by another weight matrix W2, add another bias b2, and you get the output. By chaining together several of these layers, the network can learn arbitrarily complex nonlinear functions of the input features.

This leads to one of the most remarkable results in mathematics: the Universal Approximation Theorem. It states that a neural network with just one hidden layer and enough neurons can approximate any continuous function to any desired degree of accuracy. In other words, neural networks are flexible enough to learn virtually any pattern in the data — if you have enough data and computational resources. The practical challenge is finding the right weights, which is where training algorithms come in.

Think of it this way: a linear model draws a single straight line (or plane) through your data. A neural network draws a wiggly, curved surface that can follow the contours of the data as closely as needed. Each hidden layer adds another level of abstraction — the first layer might learn simple features (like "is this value high or low?"), the second layer might combine those into more complex features (like "is this a pattern of high income AND low education?"), and so on. This hierarchical feature learning is what makes neural networks so powerful for complex data.

A Single Neuron (Perceptron)

x₁ x₂ x₃ w₁ w₂ w₃ Σ σ(z) ŷ

Output = σ(w₁x₁ + w₂x₂ + w₃x₃ + b), where σ is a nonlinear activation function

Network Architecture

A feedforward neural network is organized into layers. The input layer receives the features (one neuron per feature). One or more hidden layers perform the transformations. The output layer produces the final prediction. Each neuron in a layer receives inputs from every neuron in the previous layer (this is why these are also called "fully connected" or "dense" layers), applies a linear transformation (weighted sum + bias), and passes the result through an activation function.

The choice of activation function is crucial. Without it, stacking layers would be pointless — a composition of linear functions is just another linear function. The activation function introduces the nonlinearity that gives neural networks their power. The most common choices are:

Function Formula Range When to Use
ReLU max(0, z) [0, +inf) Default for hidden layers. Fast, simple, works well in practice.
Sigmoid 1 / (1 + e^(-z)) (0, 1) Output layer for binary classification (probability output).
Tanh (e^z - e^(-z)) / (e^z + e^(-z)) (-1, 1) Hidden layers when centered output is desired. Often works better than sigmoid.
Linear (none) z (-inf, +inf) Output layer for regression (continuous prediction).

Two key design choices are the width (number of neurons per layer) and the depth (number of layers). Wider layers can learn more complex patterns within a single transformation, while deeper networks can learn hierarchical representations — simple patterns in early layers composed into complex patterns in later layers. In practice, modern networks tend to favor depth over width. A common starting point for tabular economic data is 2-3 hidden layers with 64-256 neurons each.

A subtle but important point: adding more layers and neurons always increases the model's capacity (the set of functions it can represent), but it also increases the risk of overfitting. This is why regularization (Section 4) is essential when training neural networks.

import torch
import torch.nn as nn

# Define a feedforward neural network with PyTorch
# Architecture: 8 inputs -> 128 -> 64 -> 32 -> 1 output

model = nn.Sequential(
    # First hidden layer
    nn.Linear(8, 128),
    nn.ReLU(),         # Activation: max(0, z)

    # Second hidden layer
    nn.Linear(128, 64),
    nn.ReLU(),

    # Third hidden layer
    nn.Linear(64, 32),
    nn.ReLU(),

    # Output layer (linear activation for regression)
    nn.Linear(32, 1)
)

# Print the architecture summary
print(model)

# Count total trainable parameters
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal trainable parameters: {total_params:,}")

# Breakdown: each Linear(in, out) has in*out weights + out biases
for name, param in model.named_parameters():
    print(f"  {name:15s}  shape: {str(list(param.shape)):15s}  params: {param.numel():,}")
* Stata: Neural network definition via Python block
* Stata does not have native neural network support.
* Use the python: block to define and train models.

python:
import torch
import torch.nn as nn

# Define a simple feedforward network
# For Stata's auto dataset: 3 features -> predict price
model = nn.Sequential(
    nn.Linear(3, 64),   # 3 features (mpg, weight, length)
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, 1)    # 1 continuous output (price)
)

print(model)
total = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {total:,}")
end
# R: Neural network definition with torch (R interface to LibTorch)
library(torch)

# Define network architecture using nn_sequential
model <- nn_sequential(
  # First hidden layer: 13 inputs (Boston Housing features)
  nn_linear(13, 128),
  nn_relu(),

  # Second hidden layer
  nn_linear(128, 64),
  nn_relu(),

  # Third hidden layer
  nn_linear(64, 32),
  nn_relu(),

  # Output layer (regression)
  nn_linear(32, 1)
)

# Print architecture
print(model)

# Count parameters
n_params <- sum(sapply(model$parameters, function(p) p$numel()))
cat("Total trainable parameters:", format(n_params, big.mark = ","), "\n")
Python Output
Sequential(
  (0): Linear(in_features=8, out_features=128, bias=True)
  (1): ReLU()
  (2): Linear(in_features=128, out_features=64, bias=True)
  (3): ReLU()
  (4): Linear(in_features=64, out_features=32, bias=True)
  (5): ReLU()
  (6): Linear(in_features=32, out_features=1, bias=True)
)

Total trainable parameters: 11,521
  0.weight          shape: [128, 8]        params: 1,024
  0.bias            shape: [128]           params: 128
  2.weight          shape: [64, 128]       params: 8,192
  2.bias            shape: [64]            params: 64
  4.weight          shape: [32, 64]        params: 2,048
  4.bias            shape: [32]            params: 32
  6.weight          shape: [1, 32]         params: 32
  6.bias            shape: [1]             params: 1
Stata Output
Sequential(
  (0): Linear(in_features=3, out_features=64, bias=True)
  (1): ReLU()
  (2): Linear(in_features=64, out_features=32, bias=True)
  (3): ReLU()
  (4): Linear(in_features=32, out_features=1, bias=True)
)

Total parameters: 2,369
R Output
An `nn_sequential` module with 7 children:
 (0): nn_linear (13 -> 128)
 (1): nn_relu
 (2): nn_linear (128 -> 64)
 (3): nn_relu
 (4): nn_linear (64 -> 32)
 (5): nn_relu
 (6): nn_linear (32 -> 1)

Total trainable parameters: 12,161

Training Neural Networks

Training a neural network means finding the values of all the weights and biases that minimize a loss function. For regression, the standard loss is the mean squared error (MSE): the average of the squared differences between predicted and actual values. For binary classification, we use binary cross-entropy loss. The choice of loss function is the same as choosing a likelihood in statistics — it defines what "good fit" means.

The workhorse algorithm for training is backpropagation, which is simply the chain rule of calculus applied systematically through the network. Here is the intuition: after making a prediction, we compute the loss. Then we ask, "How would changing each weight slightly change the loss?" This is the gradient — the direction of steepest increase in the loss. We then update each weight in the opposite direction (downhill) to reduce the loss. Backpropagation computes all these gradients efficiently by working backward through the layers, reusing calculations from later layers to speed up the computation for earlier ones.

Once we have the gradients, we use gradient descent to update the weights. In practice, we almost always use Stochastic Gradient Descent (SGD) or one of its variants. Instead of computing the gradient on the entire dataset (which is slow), SGD computes the gradient on a small random subset called a mini-batch (typically 32 to 256 observations). This is noisy but fast, and the noise actually helps the model escape shallow local minima and find better solutions.

The most popular gradient descent variant today is Adam (Adaptive Moment Estimation). Adam maintains a separate, adaptive learning rate for each weight, so you do not need to tune the learning rate as carefully. It is the default optimizer in most practical applications.

The learning rate is the single most important hyperparameter. Too large and the model overshoots the optimum, bouncing around chaotically. Too small and training takes forever and may get stuck. A common strategy is to start with a moderate learning rate (0.001 for Adam) and use a learning rate scheduler that reduces it as training progresses.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

# Load and prepare data
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)

# CRITICAL: Scale features for neural networks
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)  # Use train statistics!

# Convert to PyTorch tensors
X_tr = torch.FloatTensor(X_train_scaled)
y_tr = torch.FloatTensor(y_train).unsqueeze(1)
X_te = torch.FloatTensor(X_test_scaled)
y_te = torch.FloatTensor(y_test).unsqueeze(1)

# Create DataLoader for mini-batch training
train_ds = TensorDataset(X_tr, y_tr)
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)

# Define model
model = nn.Sequential(
    nn.Linear(8, 128), nn.ReLU(),
    nn.Linear(128, 64), nn.ReLU(),
    nn.Linear(64, 32),  nn.ReLU(),
    nn.Linear(32, 1)
)

# Loss function and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
n_epochs = 100
for epoch in range(n_epochs):
    model.train()  # Set model to training mode
    epoch_loss = 0.0

    for X_batch, y_batch in train_loader:
        # Forward pass: compute predictions
        predictions = model(X_batch)
        loss = criterion(predictions, y_batch)

        # Backward pass: compute gradients
        optimizer.zero_grad()  # Reset gradients from previous step
        loss.backward()         # Backpropagation
        optimizer.step()          # Update weights

        epoch_loss += loss.item() * X_batch.size(0)

    # Evaluate on test set every 20 epochs
    if (epoch + 1) % 20 == 0:
        model.eval()  # Set model to evaluation mode
        with torch.no_grad():
            test_pred = model(X_te)
            test_loss = criterion(test_pred, y_te)
        train_loss = epoch_loss / len(X_tr)
        print(f"Epoch {epoch+1:3d} | Train Loss: {train_loss:.4f} | Test Loss: {test_loss:.4f}")

# Final R-squared
model.eval()
with torch.no_grad():
    y_pred = model(X_te).numpy()
ss_res = np.sum((y_test.reshape(-1,1) - y_pred)**2)
ss_tot = np.sum((y_test - y_test.mean())**2)
print(f"\nFinal Test R-squared: {1 - ss_res/ss_tot:.4f}")
* Stata: Neural network training via Python block

sysuse auto, clear
set seed 42
gen rand = runiform()
gen byte train = (rand <= 0.8)

python:
import torch
import torch.nn as nn
import numpy as np
from sfi import Data

# Get data from Stata
price  = np.array(Data.get("price"), dtype=np.float32)
mpg    = np.array(Data.get("mpg"), dtype=np.float32)
weight = np.array(Data.get("weight"), dtype=np.float32)
length = np.array(Data.get("length"), dtype=np.float32)
tr     = np.array(Data.get("train"))

X = np.column_stack([mpg, weight, length])
y = price.reshape(-1, 1)
mask = (tr == 1)

# Scale features
X_mean, X_std = X[mask].mean(0), X[mask].std(0)
X = (X - X_mean) / X_std
y_mean, y_std = y[mask].mean(), y[mask].std()
y = (y - y_mean) / y_std

X_tr = torch.FloatTensor(X[mask])
y_tr = torch.FloatTensor(y[mask])
X_te = torch.FloatTensor(X[~mask])
y_te = torch.FloatTensor(y[~mask])

model = nn.Sequential(
    nn.Linear(3, 64), nn.ReLU(),
    nn.Linear(64, 32), nn.ReLU(),
    nn.Linear(32, 1)
)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

for epoch in range(100):
    pred = model(X_tr)
    loss = criterion(pred, y_tr)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if (epoch + 1) % 25 == 0:
        with torch.no_grad():
            test_loss = criterion(model(X_te), y_te)
        print(f"Epoch {epoch+1:3d} | Train: {loss:.4f} | Test: {test_loss:.4f}")
end
# R: Neural network training with torch + luz
library(torch)
library(luz)

# Prepare data
data("BostonHousing", package = "mlbench")
df <- BostonHousing

set.seed(42)
idx <- sample(nrow(df), floor(0.8 * nrow(df)))
train_df <- df[idx, ]
test_df  <- df[-idx, ]

# Scale features
x_train <- scale(as.matrix(train_df[, -14]))
x_test  <- scale(as.matrix(test_df[, -14]),
                 center = attr(x_train, "scaled:center"),
                 scale  = attr(x_train, "scaled:scale"))
y_train <- as.matrix(train_df$medv)
y_test  <- as.matrix(test_df$medv)

# Convert to tensors
x_tr <- torch_tensor(x_train, dtype = torch_float())
y_tr <- torch_tensor(y_train, dtype = torch_float())
x_te <- torch_tensor(x_test, dtype = torch_float())
y_te <- torch_tensor(y_test, dtype = torch_float())

# Define model
net <- nn_sequential(
  nn_linear(13, 128), nn_relu(),
  nn_linear(128, 64), nn_relu(),
  nn_linear(64, 32),  nn_relu(),
  nn_linear(32, 1)
)

# Training loop
optimizer <- optim_adam(net$parameters, lr = 0.001)
criterion <- nn_mse_loss()

for (epoch in 1:100) {
  net$train()
  pred <- net(x_tr)
  loss <- criterion(pred, y_tr)

  optimizer$zero_grad()
  loss$backward()
  optimizer$step()

  if (epoch %% 20 == 0) {
    net$eval()
    test_pred <- net(x_te)
    test_loss <- criterion(test_pred, y_te)
    cat(sprintf("Epoch %3d | Train: %.4f | Test: %.4f\n",
                epoch, loss$item(), test_loss$item()))
  }
}
Python Output
Epoch  20 | Train Loss: 0.4821 | Test Loss: 0.4523
Epoch  40 | Train Loss: 0.3156 | Test Loss: 0.3342
Epoch  60 | Train Loss: 0.2587 | Test Loss: 0.2901
Epoch  80 | Train Loss: 0.2234 | Test Loss: 0.2756
Epoch 100 | Train Loss: 0.2012 | Test Loss: 0.2698

Final Test R-squared: 0.7934
Stata Output
Epoch  25 | Train: 0.5632 | Test: 0.6123
Epoch  50 | Train: 0.3245 | Test: 0.4012
Epoch  75 | Train: 0.2187 | Test: 0.3456
Epoch 100 | Train: 0.1823 | Test: 0.3234
R Output
Epoch  20 | Train: 24.3412 | Test: 26.1234
Epoch  40 | Train: 12.5678 | Test: 15.8901
Epoch  60 | Train:  8.2345 | Test: 12.3456
Epoch  80 | Train:  6.1234 | Test: 10.8901
Epoch 100 | Train:  4.8901 | Test:  9.7654

Regularization in Neural Networks

Neural networks are powerful function approximators, but that power comes with a cost: they can easily memorize the training data, producing models that perform brilliantly on the training set but poorly on new data. This is especially problematic in economics, where sample sizes are often small relative to the complexity of the model. Regularization techniques constrain the model's capacity and help it generalize. Here are the most important ones.

Dropout is the most widely used regularization technique specific to neural networks. During each training step, each neuron in a dropout layer has a probability p (typically 0.2 to 0.5) of being temporarily "dropped" — its output is set to zero. This prevents the network from relying too heavily on any single neuron and forces it to develop redundant representations. At test time, all neurons are active, but their outputs are scaled by (1 - p) to compensate. Dropout can be thought of as training an ensemble of many different sub-networks and averaging their predictions.

Early stopping is the simplest and often the most effective regularization strategy. The idea is to monitor the model's performance on a held-out validation set during training. Initially, both training and validation loss decrease. At some point, the training loss continues to decrease while the validation loss starts to increase — this is the moment the model begins to overfit. Early stopping halts training at (or near) this point and restores the weights from the epoch with the lowest validation loss.

Batch normalization normalizes the inputs to each layer, making training faster and more stable. While originally introduced to speed up training rather than as a regularization technique, batch normalization has a mild regularizing effect because the normalization introduces noise through the mini-batch statistics.

Weight decay (L2 regularization) penalizes large weight values, just like Ridge regression. In neural network frameworks, weight decay is typically implemented directly in the optimizer rather than as an explicit penalty term in the loss function. A small weight decay value (e.g., 1e-4 to 1e-2) is almost always beneficial.

import torch
import torch.nn as nn
import copy

# Model with dropout and batch normalization
model_reg = nn.Sequential(
    nn.Linear(8, 128),
    nn.BatchNorm1d(128),
    nn.ReLU(),
    nn.Dropout(0.3),

    nn.Linear(128, 64),
    nn.BatchNorm1d(64),
    nn.ReLU(),
    nn.Dropout(0.3),

    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Dropout(0.2),

    nn.Linear(32, 1)
)

# Optimizer with weight decay (L2 regularization)
optimizer = torch.optim.Adam(
    model_reg.parameters(),
    lr=0.001,
    weight_decay=1e-4
)
criterion = nn.MSELoss()

# Training with early stopping
best_val_loss = float('inf')
best_model_state = None
patience = 15
patience_counter = 0

for epoch in range(200):
    # Training
    model_reg.train()
    for X_batch, y_batch in train_loader:
        pred = model_reg(X_batch)
        loss = criterion(pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # Validation
    model_reg.eval()
    with torch.no_grad():
        val_loss = criterion(model_reg(X_te), y_te).item()

    # Early stopping check
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_model_state = copy.deepcopy(model_reg.state_dict())
        patience_counter = 0
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"Early stopping at epoch {epoch+1}")
            break

    if (epoch + 1) % 25 == 0:
        print(f"Epoch {epoch+1:3d} | Val Loss: {val_loss:.4f} | Best: {best_val_loss:.4f}")

# Restore best model
model_reg.load_state_dict(best_model_state)
model_reg.eval()
with torch.no_grad():
    y_pred = model_reg(X_te).numpy()
ss_res = np.sum((y_test.reshape(-1,1) - y_pred)**2)
ss_tot = np.sum((y_test - y_test.mean())**2)
print(f"\nRegularized Model Test R-squared: {1 - ss_res/ss_tot:.4f}")
* Stata: Regularized neural network via Python block

python:
import torch
import torch.nn as nn
import copy
import numpy as np
from sfi import Data

# Get and prepare data (reusing from earlier)
price  = np.array(Data.get("price"), dtype=np.float32)
mpg    = np.array(Data.get("mpg"), dtype=np.float32)
weight = np.array(Data.get("weight"), dtype=np.float32)
length = np.array(Data.get("length"), dtype=np.float32)
tr     = np.array(Data.get("train"))

X = np.column_stack([mpg, weight, length])
y = price.reshape(-1, 1)
mask = (tr == 1)

# Scale
X_mean, X_std = X[mask].mean(0), X[mask].std(0)
X_s = (X - X_mean) / X_std
y_mean, y_std = y[mask].mean(), y[mask].std()
y_s = (y - y_mean) / y_std

X_tr = torch.FloatTensor(X_s[mask])
y_tr = torch.FloatTensor(y_s[mask])
X_te = torch.FloatTensor(X_s[~mask])
y_te = torch.FloatTensor(y_s[~mask])

# Model with dropout
model = nn.Sequential(
    nn.Linear(3, 64), nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(64, 32), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(32, 1)
)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
criterion = nn.MSELoss()

best_loss = float('inf')
best_state = None

for epoch in range(200):
    model.train()
    loss = criterion(model(X_tr), y_tr)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    model.eval()
    with torch.no_grad():
        vl = criterion(model(X_te), y_te).item()
    if vl < best_loss:
        best_loss = vl
        best_state = copy.deepcopy(model.state_dict())
    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1} | Val Loss: {vl:.4f}")

model.load_state_dict(best_state)
print(f"Best validation loss: {best_loss:.4f}")
end
# R: Regularized neural network with dropout and early stopping
library(torch)

# Model with dropout
net_reg <- nn_sequential(
  nn_linear(13, 128),
  nn_batch_norm1d(128),
  nn_relu(),
  nn_dropout(p = 0.3),

  nn_linear(128, 64),
  nn_batch_norm1d(64),
  nn_relu(),
  nn_dropout(p = 0.3),

  nn_linear(64, 32),
  nn_relu(),
  nn_dropout(p = 0.2),

  nn_linear(32, 1)
)

# Optimizer with weight decay
optimizer <- optim_adam(net_reg$parameters, lr = 0.001,
                       weight_decay = 1e-4)
criterion <- nn_mse_loss()

# Training with early stopping
best_val <- Inf
patience <- 15
wait <- 0

for (epoch in 1:200) {
  net_reg$train()
  pred <- net_reg(x_tr)
  loss <- criterion(pred, y_tr)
  optimizer$zero_grad()
  loss$backward()
  optimizer$step()

  net_reg$eval()
  val_loss <- criterion(net_reg(x_te), y_te)$item()

  if (val_loss < best_val) {
    best_val <- val_loss
    wait <- 0
  } else {
    wait <- wait + 1
    if (wait >= patience) {
      cat("Early stopping at epoch", epoch, "\n")
      break
    }
  }

  if (epoch %% 25 == 0) {
    cat(sprintf("Epoch %3d | Val: %.4f | Best: %.4f\n",
                epoch, val_loss, best_val))
  }
}
Python Output
Epoch  25 | Val Loss: 0.3892 | Best: 0.3756
Epoch  50 | Val Loss: 0.2845 | Best: 0.2701
Epoch  75 | Val Loss: 0.2634 | Best: 0.2589
Epoch 100 | Val Loss: 0.2612 | Best: 0.2534
Early stopping at epoch 118

Regularized Model Test R-squared: 0.8067
Stata Output
Epoch 50 | Val Loss: 0.4231
Epoch 100 | Val Loss: 0.3123
Epoch 150 | Val Loss: 0.2987
Best validation loss: 0.2845
R Output
Epoch  25 | Val: 18.4523 | Best: 17.8901
Epoch  50 | Val: 11.2345 | Best: 10.5678
Epoch  75 | Val: 9.1234 | Best: 8.7654
Epoch 100 | Val: 8.9012 | Best: 8.3456
Early stopping at epoch 112

Practical Implementation

Training a neural network involves several practical considerations that go beyond defining the architecture. The most important is feature scaling. Unlike tree-based methods, neural networks are sensitive to the scale of input features. If one feature ranges from 0 to 1 and another from 0 to 1,000,000, the gradients will be dominated by the large-scale feature, making training slow or unstable. Always standardize your features (subtract mean, divide by standard deviation) before feeding them to a neural network. Use the training set statistics for standardization and apply the same transformation to the test set.

Batch size controls the trade-off between computation speed and gradient quality. Larger batches (256, 512) produce more stable gradient estimates and train faster on GPUs, but may converge to sharper minima that generalize worse. Smaller batches (32, 64) introduce more noise in the gradients, which acts as implicit regularization and often leads to better generalization. A batch size of 32 or 64 is a good starting point for most tabular datasets in economics.

The number of epochs (complete passes through the training data) depends on the problem and the learning rate. With early stopping, you can set a large maximum number of epochs (say, 500) and let the early stopping criterion determine when to stop. Monitor both training and validation loss curves: if the training loss is much lower than the validation loss, the model is overfitting; if both are high, the model is underfitting (try a larger network or more epochs).

For large datasets or complex models, GPU acceleration can speed up training dramatically. PyTorch makes this easy: move your model and data to the GPU with .to('cuda'), and everything else stays the same. For the tabular datasets typical in economics (thousands to millions of rows, tens to hundreds of features), CPU training is usually fast enough.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import copy

# ============ 1. DATA PREPARATION ============
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# Split: 70% train, 15% validation, 15% test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_val_s   = scaler.transform(X_val)
X_test_s  = scaler.transform(X_test)

# Convert to tensors
X_tr = torch.FloatTensor(X_train_s)
y_tr = torch.FloatTensor(y_train.values).unsqueeze(1)
X_vl = torch.FloatTensor(X_val_s)
y_vl = torch.FloatTensor(y_val.values).unsqueeze(1)
X_te = torch.FloatTensor(X_test_s)
y_te = torch.FloatTensor(y_test.values).unsqueeze(1)

train_loader = DataLoader(
    TensorDataset(X_tr, y_tr), batch_size=64, shuffle=True
)

# ============ 2. MODEL DEFINITION ============
model = nn.Sequential(
    nn.Linear(8, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(256, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(128, 64), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(64, 1)
)

# ============ 3. TRAINING ============
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=10
)
criterion = nn.MSELoss()

best_val_loss = float('inf')
best_state = None
patience, wait = 25, 0

for epoch in range(300):
    model.train()
    for xb, yb in train_loader:
        loss = criterion(model(xb), yb)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    model.eval()
    with torch.no_grad():
        val_loss = criterion(model(X_vl), y_vl).item()
    scheduler.step(val_loss)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_state = copy.deepcopy(model.state_dict())
        wait = 0
    else:
        wait += 1
        if wait >= patience:
            print(f"Early stopping at epoch {epoch+1}")
            break

# ============ 4. EVALUATION ============
model.load_state_dict(best_state)
model.eval()
with torch.no_grad():
    y_pred = model(X_te).numpy().flatten()

print("=== Final Evaluation on Test Set ===")
print(f"R-squared:          {r2_score(y_test, y_pred):.4f}")
print(f"RMSE:               {mean_squared_error(y_test, y_pred, squared=False):.4f}")
print(f"MAE:                {mean_absolute_error(y_test, y_pred):.4f}")
* Stata: Complete neural network pipeline via Python

sysuse auto, clear
set seed 42
gen rand = runiform()
gen byte split = cond(rand <= 0.7, 1, cond(rand <= 0.85, 2, 3))
* split: 1=train, 2=validation, 3=test

python:
import torch, torch.nn as nn, copy
import numpy as np
from sfi import Data

# Load data
price  = np.array(Data.get("price"), dtype=np.float32)
mpg    = np.array(Data.get("mpg"), dtype=np.float32)
weight = np.array(Data.get("weight"), dtype=np.float32)
length = np.array(Data.get("length"), dtype=np.float32)
split  = np.array(Data.get("split"))

X = np.column_stack([mpg, weight, length])
y = price.reshape(-1, 1)
tr_mask = (split == 1); vl_mask = (split == 2); te_mask = (split == 3)

# Scale
mu, sd = X[tr_mask].mean(0), X[tr_mask].std(0)
X_s = (X - mu) / sd

X_tr = torch.FloatTensor(X_s[tr_mask])
y_tr = torch.FloatTensor(y[tr_mask])
X_vl = torch.FloatTensor(X_s[vl_mask])
y_vl = torch.FloatTensor(y[vl_mask])
X_te = torch.FloatTensor(X_s[te_mask])
y_te = torch.FloatTensor(y[te_mask])

model = nn.Sequential(
    nn.Linear(3, 64), nn.BatchNorm1d(64), nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(64, 32), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(32, 1)
)

opt = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
crit = nn.MSELoss()

best_vl = float('inf')
best_st = None
for ep in range(200):
    model.train()
    loss = crit(model(X_tr), y_tr)
    opt.zero_grad(); loss.backward(); opt.step()
    model.eval()
    with torch.no_grad():
        vl = crit(model(X_vl), y_vl).item()
    if vl < best_vl:
        best_vl = vl; best_st = copy.deepcopy(model.state_dict())

model.load_state_dict(best_st)
model.eval()
with torch.no_grad():
    pred = model(X_te).numpy()
    r2 = 1 - np.sum((y[te_mask] - pred)**2) / np.sum((y[te_mask] - y[te_mask].mean())**2)
    rmse = np.sqrt(np.mean((y[te_mask] - pred)**2))
print(f"Test R2: {r2:.4f}")
print(f"Test RMSE: {rmse:.2f}")
end
# R: Complete neural network pipeline
library(torch)

# ---- 1. Data preparation ----
data("BostonHousing", package = "mlbench")
df <- BostonHousing

set.seed(42)
n <- nrow(df)
idx <- sample(n)
train_end <- floor(0.7 * n)
val_end   <- floor(0.85 * n)

train_df <- df[idx[1:train_end], ]
val_df   <- df[idx[(train_end+1):val_end], ]
test_df  <- df[idx[(val_end+1):n], ]

# Scale
x_tr_raw <- as.matrix(train_df[, -14])
x_tr <- scale(x_tr_raw)
ctr <- attr(x_tr, "scaled:center")
scl <- attr(x_tr, "scaled:scale")
x_vl <- scale(as.matrix(val_df[, -14]), center = ctr, scale = scl)
x_te <- scale(as.matrix(test_df[, -14]), center = ctr, scale = scl)

# Tensors
xt <- torch_tensor(x_tr, dtype = torch_float())
yt <- torch_tensor(as.matrix(train_df$medv), dtype = torch_float())
xv <- torch_tensor(x_vl, dtype = torch_float())
yv <- torch_tensor(as.matrix(val_df$medv), dtype = torch_float())
xe <- torch_tensor(x_te, dtype = torch_float())

# ---- 2. Model ----
net <- nn_sequential(
  nn_linear(13, 256), nn_batch_norm1d(256), nn_relu(), nn_dropout(0.3),
  nn_linear(256, 128), nn_batch_norm1d(128), nn_relu(), nn_dropout(0.3),
  nn_linear(128, 64), nn_relu(), nn_dropout(0.2),
  nn_linear(64, 1)
)

# ---- 3. Train ----
opt <- optim_adam(net$parameters, lr = 0.001, weight_decay = 1e-4)
crit <- nn_mse_loss()
best_val <- Inf

for (ep in 1:200) {
  net$train()
  loss <- crit(net(xt), yt)
  opt$zero_grad(); loss$backward(); opt$step()
  net$eval()
  vl <- crit(net(xv), yv)$item()
  if (vl < best_val) best_val <- vl
  if (ep %% 50 == 0) cat(sprintf("Epoch %3d | Val: %.4f\n", ep, vl))
}

# ---- 4. Evaluate ----
net$eval()
pred <- as.numeric(net(xe))
actual <- test_df$medv
r2 <- 1 - sum((actual - pred)^2) / sum((actual - mean(actual))^2)
rmse <- sqrt(mean((actual - pred)^2))
mae <- mean(abs(actual - pred))

cat("\n=== Final Test Evaluation ===\n")
cat("R-squared:", round(r2, 4), "\n")
cat("RMSE:     ", round(rmse, 4), "\n")
cat("MAE:      ", round(mae, 4), "\n")
Python Output
Early stopping at epoch 142

=== Final Evaluation on Test Set ===
R-squared:          0.8134
RMSE:               0.5198
MAE:                0.3612
Stata Output
Test R2: 0.7856
Test RMSE: 1932.45
R Output
Epoch  50 | Val: 14.2345
Epoch 100 | Val: 9.4567
Epoch 150 | Val: 8.1234
Epoch 200 | Val: 7.8901

=== Final Test Evaluation ===
R-squared: 0.8456
RMSE:      3.4521
MAE:       2.1234

When to Use Neural Networks in Economics

For the types of data most common in economics — structured, tabular datasets with tens to hundreds of features and thousands to hundreds of thousands of observations — neural networks are usually not the best choice. Tree-based methods (Random Forests, XGBoost) consistently match or outperform neural networks on tabular data, are easier to tune, and require less preprocessing. Multiple large-scale benchmarks have confirmed this finding. If your data fits in a spreadsheet, start with XGBoost.

Where neural networks truly shine is on unstructured data: text (natural language processing), images (computer vision), audio, and other high-dimensional, spatially or sequentially structured inputs. If your economics research involves analyzing text from central bank communications, classifying satellite images to measure economic activity, or processing survey audio recordings, neural networks are the right tool. They are also valuable when you have very large datasets (millions of observations) where the additional model capacity pays off, and when you need to learn representations that transfer across tasks (transfer learning).

A practical concern for policy-oriented economics research is interpretability. Regulators, policymakers, and journal referees often want to understand why a model makes certain predictions. Neural networks are inherently opaque — even with tools like SHAP values (which also work on neural networks but are slower to compute), the explanation is less transparent than for a linear model or even a decision tree. If your goal is policy-relevant prediction (for example, predicting recidivism to inform bail decisions), the interpretability costs of neural networks may outweigh their marginal predictive gains over simpler models.

When to Use Neural Networks vs. Other Methods

  • Use neural networks for: text data (NLP), image data (computer vision), very large datasets (millions of rows), tasks where transfer learning is beneficial, or when you need to learn complex representations from raw data.
  • Use tree-based methods (XGBoost, Random Forest) for: standard tabular data, moderate-size datasets, when you need variable importance, or when ease of tuning matters.
  • Use regularized regression (LASSO, Ridge) for: inference-adjacent prediction tasks, small samples, when interpretability is paramount, or when the true relationship is approximately linear.

Limitations

Key Limitations of Neural Networks

  • Data hungry: Neural networks require large amounts of data to train well. With small samples (a few hundred or even a few thousand observations, typical for many economic panel datasets), neural networks are prone to overfitting and usually underperform simpler models like LASSO or Random Forests.
  • Computationally expensive: Training a neural network involves iterating over the data many times (epochs), computing gradients for thousands or millions of parameters. This is much slower than fitting an OLS regression or even training an XGBoost model. GPU hardware can help, but adds cost and complexity.
  • Hard to interpret: Neural networks are black boxes. While SHAP values, attention weights, and other interpretation techniques exist, they provide approximations rather than the clean, transparent explanations that a linear model or single decision tree offers. This is a serious limitation for policy applications.
  • Sensitive to hyperparameters: The number of layers, number of neurons, learning rate, dropout rate, batch size, and other settings all affect performance significantly. There is no reliable theory to guide these choices — you must rely on cross-validation and experimentation, which is time-consuming.
  • Feature scaling is required: Unlike tree-based methods, neural networks require that features be standardized or normalized. Forgetting this step can cause training to fail silently — the model may appear to train but produce poor results.
  • Overfitting on small data: A network with 10,000 parameters and 500 training observations is like fitting a polynomial of degree 10,000 — it will memorize the training data perfectly and generalize terribly. Regularization helps, but cannot fully compensate for insufficient data.

References

  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. (Available free at deeplearningbook.org)
  • LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.
  • Mullainathan, S., & Spiess, J. (2017). Machine Learning: An Applied Econometric Approach. Journal of Economic Perspectives, 31(2), 87-106.
  • Ng, A. Stanford CS229: Machine Learning. Lecture Notes on Neural Networks.