11C Neural Networks

~2.5 hours Architecture, Training, Regularization, Implementation

Learning Objectives

Understand how neural networks extend linear models with nonlinear activation functions
Build a feedforward neural network from scratch in Python (PyTorch)
Train networks using backpropagation and gradient descent
Apply regularization techniques (dropout, early stopping) to prevent overfitting
Know when neural networks are (and aren't) the right choice for economics research

From Linear Models to Neural Networks
Network Architecture
Training Neural Networks
Regularization in Neural Networks
Practical Implementation
When to Use Neural Networks in Economics
Limitations

From Linear Models to Neural Networks

You already know linear regression: the predicted outcome is a weighted sum of the features, y = Xβ. This is powerful and interpretable, but it can only learn linear relationships. If the true relationship between your features and the outcome is nonlinear (for example, the effect of education on earnings is different at different levels of education), a linear model will miss it unless you manually engineer the right transformations -- adding squared terms, interactions, polynomials, and so on.

The Factory Assembly Line Analogy

A neural network is like a factory assembly line. Raw materials (your input data -- income, education, age) enter at one end. They pass through multiple processing stages (layers), each one transforming the data a little bit -- combining features, filtering out noise, detecting patterns. At the end of the line, a finished product (a prediction) comes out. Just as a car factory turns raw steel and rubber into a car through a sequence of specialized stations, a neural network turns raw numbers into a prediction through a sequence of learned transformations. The key difference from a linear model is that each station can do nonlinear things -- bending, shaping, and combining -- rather than just adding things up.

A neural network automates the process of feature engineering. The key idea is to stack multiple linear transformations with nonlinear activation functions between them. Take your features X, multiply by a weight matrix W1, add a bias b1, then apply a nonlinear function (like ReLU, which replaces negative values with zero). The result is a new set of "features" -- called a hidden layer. Then take those hidden features, multiply by another weight matrix W2, add another bias b2, and you get the output. By chaining together several of these layers, the network can learn arbitrarily complex nonlinear functions of the input features.

This leads to one of the most remarkable results in mathematics: the Universal Approximation Theorem. It states that a neural network with just one hidden layer and enough neurons can approximate any continuous function to any desired degree of accuracy. In other words, neural networks are flexible enough to learn virtually any pattern in the data -- if you have enough data and computational resources. The practical challenge is finding the right weights, which is where training algorithms come in.

Think of it this way: a linear model draws a single straight line (or plane) through your data. A neural network draws a wiggly, curved surface that can follow the contours of the data as closely as needed. Each hidden layer adds another level of abstraction -- the first layer might learn simple features (like "is this value high or low?"), the second layer might combine those into more complex features (like "is this a pattern of high income AND low education?"), and so on. This hierarchical feature learning is what makes neural networks so powerful for complex data.

A Single Neuron (Perceptron)

Output = σ(w₁x₁ + w₂x₂ + w₃x₃ + b), where σ is a nonlinear activation function

Network Architecture

A feedforward neural network is organized into layers. The input layer receives the features (one neuron per feature). One or more hidden layers perform the transformations. The output layer produces the final prediction. Each neuron in a layer receives inputs from every neuron in the previous layer (this is why these are also called "fully connected" or "dense" layers), applies a linear transformation (weighted sum + bias), and passes the result through an activation function.

Feedforward Neural Network Architecture

Every neuron in one layer connects to every neuron in the next layer ("fully connected"). Each connection has a learnable weight.

The choice of activation function is crucial. Without it, stacking layers would be pointless -- a composition of linear functions is just another linear function. An activation function is like a filter -- it decides which signals to pass through and which to dampen. This nonlinearity is what gives neural networks their power.

Function	Formula	Range	When to Use
ReLU	`max(0, z)`	[0, +inf)	Default for hidden layers. Fast, simple, works well in practice.
Sigmoid	`1 / (1 + e^(-z))`	(0, 1)	Output layer for binary classification (probability output).
Tanh	`(e^z - e^(-z)) / (e^z + e^(-z))`	(-1, 1)	Hidden layers when centered output is desired. Often works better than sigmoid.
Linear (none)	`z`	(-inf, +inf)	Output layer for regression (continuous prediction).

Activation Functions Visualized

Two key design choices are the width (number of neurons per layer) and the depth (number of layers). Wider layers can learn more complex patterns within a single transformation, while deeper networks can learn hierarchical representations -- simple patterns in early layers composed into complex patterns in later layers. In practice, modern networks tend to favor depth over width. A common starting point for tabular economic data is 2-3 hidden layers with 64-256 neurons each.

A subtle but important point: adding more layers and neurons always increases the model's capacity (the set of functions it can represent), but it also increases the risk of overfitting. This is why regularization (Section 4) is essential when training neural networks.

Common Layer Types and What They Do

Linear / Dense	The basic building block. Every input connects to every output (`nn.Linear` in PyTorch). Used for tabular data.
Conv2d (Convolutional)	Slides a small filter across an image to detect local patterns (edges, textures). Used for image data -- for example, classifying satellite imagery to estimate regional GDP.
LSTM / GRU (Recurrent)	Processes sequential data one step at a time while maintaining a "memory" of previous steps. Used for time series -- for example, forecasting inflation from monthly economic indicators.
Transformer / Attention	Processes all positions in a sequence simultaneously, learning which parts to "attend to." The architecture behind GPT, BERT, and modern NLP -- for example, analyzing central bank communications.

import torch
import torch.nn as nn

# Define a feedforward neural network with PyTorch
# Architecture: 8 inputs -> 128 -> 64 -> 32 -> 1 output

model = nn.Sequential(
    # First hidden layer
    nn.Linear(8, 128),
    nn.ReLU(),         # Activation: max(0, z)

    # Second hidden layer
    nn.Linear(128, 64),
    nn.ReLU(),

    # Third hidden layer
    nn.Linear(64, 32),
    nn.ReLU(),

    # Output layer (linear activation for regression)
    nn.Linear(32, 1)
)

# Print the architecture summary
print(model)

# Count total trainable parameters
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal trainable parameters: {total_params:,}")

# Breakdown: each Linear(in, out) has in*out weights + out biases
for name, param in model.named_parameters():
    print(f"  {name:15s}  shape: {str(list(param.shape)):15s}  params: {param.numel():,}")

* Stata: Neural network definition via Python block
* Stata does not have native neural network support.
* Use the python: block to define and train models.

python:
import torch
import torch.nn as nn

# Define a simple feedforward network
# For Stata's auto dataset: 3 features -> predict price
model = nn.Sequential(
    nn.Linear(3, 64),   # 3 features (mpg, weight, length)
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, 1)    # 1 continuous output (price)
)

print(model)
total = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {total:,}")
end

# R: Neural network definition with torch (R interface to LibTorch)
library(torch)

# Define network architecture using nn_sequential
model <- nn_sequential(
  # First hidden layer: 13 inputs (Boston Housing features)
  nn_linear(13, 128),
  nn_relu(),

  # Second hidden layer
  nn_linear(128, 64),
  nn_relu(),

  # Third hidden layer
  nn_linear(64, 32),
  nn_relu(),

  # Output layer (regression)
  nn_linear(32, 1)
)

# Print architecture
print(model)

# Count parameters
n_params <- sum(sapply(model$parameters, function(p) p$numel()))
cat("Total trainable parameters:", format(n_params, big.mark = ","), "\n")

Python Output

Sequential(
  (0): Linear(in_features=8, out_features=128, bias=True)
  (1): ReLU()
  (2): Linear(in_features=128, out_features=64, bias=True)
  (3): ReLU()
  (4): Linear(in_features=64, out_features=32, bias=True)
  (5): ReLU()
  (6): Linear(in_features=32, out_features=1, bias=True)
)

Total trainable parameters: 11,521
  0.weight          shape: [128, 8]        params: 1,024
  0.bias            shape: [128]           params: 128
  2.weight          shape: [64, 128]       params: 8,192
  2.bias            shape: [64]            params: 64
  4.weight          shape: [32, 64]        params: 2,048
  4.bias            shape: [32]            params: 32
  6.weight          shape: [1, 32]         params: 32
  6.bias            shape: [1]             params: 1

Stata Output

Sequential(
  (0): Linear(in_features=3, out_features=64, bias=True)
  (1): ReLU()
  (2): Linear(in_features=64, out_features=32, bias=True)
  (3): ReLU()
  (4): Linear(in_features=32, out_features=1, bias=True)
)

Total parameters: 2,369

R Output

An `nn_sequential` module with 7 children:
 (0): nn_linear (13 -> 128)
 (1): nn_relu
 (2): nn_linear (128 -> 64)
 (3): nn_relu
 (4): nn_linear (64 -> 32)
 (5): nn_relu
 (6): nn_linear (32 -> 1)

Total trainable parameters: 12,161

Training Neural Networks

Training a neural network means finding the values of all the weights and biases that minimize a loss function -- a measure of how wrong the predictions are.

Regression Losses

MSE (nn.MSELoss): Mean Squared Error. Average of squared differences between predicted and actual values. Penalizes large errors heavily. The default for continuous outcomes.
MAE (nn.L1Loss): Mean Absolute Error. Average of absolute differences. More robust to outliers than MSE.

Classification Losses

Cross-Entropy (nn.CrossEntropyLoss): For multi-class classification (e.g., predicting industry sector). Expects raw logits, not probabilities.
BCE (nn.BCEWithLogitsLoss): For binary classification (e.g., predicting default vs. no default). Combines sigmoid + binary cross-entropy for numerical stability.

The workhorse algorithm for training is backpropagation, which is the chain rule of calculus applied systematically through the network. Backpropagation is like tracing back through the factory to find which machine caused a defect: after the final product (prediction) turns out wrong, you walk backwards through each station asking "how much did you contribute to the error?" and adjust each machine accordingly.

Once we have the gradients, we use gradient descent to update the weights. In practice, we almost always use Stochastic Gradient Descent (SGD) or one of its variants with a random mini-batch (typically 32 to 256 observations) rather than the full dataset.

SGD vs. Adam: Which Optimizer to Use

SGD	The classic optimizer. Updates every weight by the same learning rate times the gradient. Simple, but requires careful learning rate tuning. Can find better final solutions with the right schedule. Use when you have time to tune and want the best possible generalization.
Adam	Maintains a separate adaptive learning rate per weight based on the history of gradients. Converges faster and requires less tuning. The default choice for most applications. Start with `lr=0.001`.

The learning rate is the single most important hyperparameter. Too large and the model overshoots the optimum, bouncing around chaotically. Too small and training takes forever and may get stuck. A common strategy is to start with a moderate learning rate (0.001 for Adam) and use a learning rate scheduler that reduces it as training progresses.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

# Load and prepare data
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)

# CRITICAL: Scale features for neural networks
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)  # Use train statistics!

# Convert to PyTorch tensors
X_tr = torch.tensor(X_train_scaled, dtype=torch.float32)
y_tr = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
X_te = torch.tensor(X_test_scaled, dtype=torch.float32)
y_te = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)

# Create DataLoader for mini-batch training
train_ds = TensorDataset(X_tr, y_tr)
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)

# Define model
model = nn.Sequential(
    nn.Linear(8, 128), nn.ReLU(),
    nn.Linear(128, 64), nn.ReLU(),
    nn.Linear(64, 32),  nn.ReLU(),
    nn.Linear(32, 1)
)

# Loss function and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
n_epochs = 100
for epoch in range(n_epochs):
    model.train()  # Set model to training mode
    epoch_loss = 0.0

    for X_batch, y_batch in train_loader:
        # Forward pass: compute predictions
        predictions = model(X_batch)
        loss = criterion(predictions, y_batch)

        # Backward pass: compute gradients
        optimizer.zero_grad()  # Reset gradients from previous step
        loss.backward()         # Backpropagation
        optimizer.step()          # Update weights

        epoch_loss += loss.item() * X_batch.size(0)

    # Evaluate on test set every 20 epochs
    if (epoch + 1) % 20 == 0:
        model.eval()  # Set model to evaluation mode
        with torch.no_grad():
            test_pred = model(X_te)
            test_loss = criterion(test_pred, y_te)
        train_loss = epoch_loss / len(X_tr)
        print(f"Epoch {epoch+1:3d} | Train Loss: {train_loss:.4f} | Test Loss: {test_loss:.4f}")

# Final R-squared
model.eval()
with torch.no_grad():
    y_pred = model(X_te).numpy()
ss_res = np.sum((y_test.reshape(-1, 1) - y_pred) ** 2)
ss_tot = np.sum((y_test - y_test.mean()) ** 2)
print(f"\nFinal Test R-squared: {1 - ss_res / ss_tot:.4f}")

* Stata: Neural network training via Python block

sysuse auto, clear
set seed 42
gen rand = runiform()
gen byte train = (rand <= 0.8)

python:
import torch
import torch.nn as nn
import numpy as np
from sfi import Data

# Get data from Stata
price  = np.array(Data.get("price"), dtype=np.float32)
mpg    = np.array(Data.get("mpg"), dtype=np.float32)
weight = np.array(Data.get("weight"), dtype=np.float32)
length = np.array(Data.get("length"), dtype=np.float32)
tr     = np.array(Data.get("train"))

X = np.column_stack([mpg, weight, length])
y = price.reshape(-1, 1)
mask = (tr == 1)

# Scale features
X_mean, X_std = X[mask].mean(0), X[mask].std(0)
X = (X - X_mean) / X_std
y_mean, y_std = y[mask].mean(), y[mask].std()
y = (y - y_mean) / y_std

X_tr = torch.tensor(X[mask], dtype=torch.float32)
y_tr = torch.tensor(y[mask], dtype=torch.float32)
X_te = torch.tensor(X[~mask], dtype=torch.float32)
y_te = torch.tensor(y[~mask], dtype=torch.float32)

model = nn.Sequential(
    nn.Linear(3, 64), nn.ReLU(),
    nn.Linear(64, 32), nn.ReLU(),
    nn.Linear(32, 1)
)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

for epoch in range(100):
    model.train()
    pred = model(X_tr)
    loss = criterion(pred, y_tr)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if (epoch + 1) % 25 == 0:
        model.eval()
        with torch.no_grad():
            test_loss = criterion(model(X_te), y_te)
        print(f"Epoch {epoch+1:3d} | Train: {loss.item():.4f} | Test: {test_loss.item():.4f}")
end

# R: Neural network training with torch
library(torch)

# Prepare data
data("BostonHousing", package = "mlbench")
df <- BostonHousing

set.seed(42)
idx <- sample(nrow(df), floor(0.8 * nrow(df)))
train_df <- df[idx, ]
test_df  <- df[-idx, ]

# Scale features
x_train <- scale(as.matrix(train_df[, -14]))
x_test  <- scale(as.matrix(test_df[, -14]),
                 center = attr(x_train, "scaled:center"),
                 scale  = attr(x_train, "scaled:scale"))
y_train <- as.matrix(train_df$medv)
y_test  <- as.matrix(test_df$medv)

# Convert to tensors
x_tr <- torch_tensor(x_train, dtype = torch_float())
y_tr <- torch_tensor(y_train, dtype = torch_float())
x_te <- torch_tensor(x_test, dtype = torch_float())
y_te <- torch_tensor(y_test, dtype = torch_float())

# Define model
net <- nn_sequential(
  nn_linear(13, 128), nn_relu(),
  nn_linear(128, 64), nn_relu(),
  nn_linear(64, 32),  nn_relu(),
  nn_linear(32, 1)
)

# Training loop
optimizer <- optim_adam(net$parameters, lr = 0.001)
criterion <- nn_mse_loss()

for (epoch in 1:100) {
  net$train()
  pred <- net(x_tr)
  loss <- criterion(pred, y_tr)

  optimizer$zero_grad()
  loss$backward()
  optimizer$step()

  if (epoch %% 20 == 0) {
    net$eval()
    test_pred <- net(x_te)
    test_loss <- criterion(test_pred, y_te)
    cat(sprintf("Epoch %3d | Train: %.4f | Test: %.4f\n",
                epoch, loss$item(), test_loss$item()))
  }
}

Python Output

Epoch  20 | Train Loss: 0.4821 | Test Loss: 0.4523
Epoch  40 | Train Loss: 0.3156 | Test Loss: 0.3342
Epoch  60 | Train Loss: 0.2587 | Test Loss: 0.2901
Epoch  80 | Train Loss: 0.2234 | Test Loss: 0.2756
Epoch 100 | Train Loss: 0.2012 | Test Loss: 0.2698

Final Test R-squared: 0.7934

Stata Output

Epoch  25 | Train: 0.5632 | Test: 0.6123
Epoch  50 | Train: 0.3245 | Test: 0.4012
Epoch  75 | Train: 0.2187 | Test: 0.3456
Epoch 100 | Train: 0.1823 | Test: 0.3234

R Output

Epoch  20 | Train: 24.3412 | Test: 26.1234
Epoch  40 | Train: 12.5678 | Test: 15.8901
Epoch  60 | Train:  8.2345 | Test: 12.3456
Epoch  80 | Train:  6.1234 | Test: 10.8901
Epoch 100 | Train:  4.8901 | Test:  9.7654

Regularization in Neural Networks

Neural networks are powerful function approximators, but that power comes with a cost: they can easily memorize the training data, producing models that perform brilliantly on the training set but poorly on new data. This is especially problematic in economics, where sample sizes are often small relative to the complexity of the model. Regularization techniques constrain the model's capacity and help it generalize.

Overfitting: When to Stop Training

The gap between training loss and validation loss signals overfitting. Early stopping halts training at the optimal point.

Here are the most important regularization techniques for neural networks.

Dropout is the most widely used regularization technique specific to neural networks. During each training step, each neuron in a dropout layer has a probability p (typically 0.2 to 0.5) of being temporarily "dropped" -- its output is set to zero. Dropout is like randomly sending workers on coffee breaks during training -- it forces the remaining workers to be more versatile and not rely on any single colleague. At test time, all neurons are active, but their outputs are scaled by (1 - p) to compensate.

Early stopping is the simplest and often the most effective regularization strategy. Monitor the validation loss during training. When it stops improving for a specified number of epochs (called "patience"), stop training and restore the weights from the best epoch.

Weight decay (L2 regularization) penalizes large weight values, just like Ridge regression. In PyTorch, weight decay is implemented directly in the optimizer via the weight_decay parameter. A value of 1e-4 to 1e-2 is almost always beneficial.

Batch normalization normalizes the inputs to each layer, making training faster and more stable. While primarily a training speedup technique, it also has a mild regularizing effect through the noise in mini-batch statistics.

import torch
import torch.nn as nn
import numpy as np
import copy
from torch.utils.data import DataLoader, TensorDataset
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

# Prepare data (same as before)
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

X_tr = torch.tensor(X_train_s, dtype=torch.float32)
y_tr = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
X_te = torch.tensor(X_test_s, dtype=torch.float32)
y_te = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)
train_loader = DataLoader(TensorDataset(X_tr, y_tr), batch_size=64, shuffle=True)

# Model with dropout and batch normalization
model_reg = nn.Sequential(
    nn.Linear(8, 128),
    nn.BatchNorm1d(128),
    nn.ReLU(),
    nn.Dropout(0.3),

    nn.Linear(128, 64),
    nn.BatchNorm1d(64),
    nn.ReLU(),
    nn.Dropout(0.3),

    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Dropout(0.2),

    nn.Linear(32, 1)
)

# Optimizer with weight decay (L2 regularization)
optimizer = torch.optim.Adam(
    model_reg.parameters(),
    lr=0.001,
    weight_decay=1e-4
)
criterion = nn.MSELoss()

# Training with early stopping
best_val_loss = float('inf')
best_model_state = None
patience = 15
patience_counter = 0

for epoch in range(200):
    # Training
    model_reg.train()
    for X_batch, y_batch in train_loader:
        pred = model_reg(X_batch)
        loss = criterion(pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # Validation
    model_reg.eval()
    with torch.no_grad():
        val_loss = criterion(model_reg(X_te), y_te).item()

    # Early stopping check
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_model_state = copy.deepcopy(model_reg.state_dict())
        patience_counter = 0
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"Early stopping at epoch {epoch + 1}")
            break

    if (epoch + 1) % 25 == 0:
        print(f"Epoch {epoch+1:3d} | Val Loss: {val_loss:.4f} | Best: {best_val_loss:.4f}")

# Restore best model
model_reg.load_state_dict(best_model_state)
model_reg.eval()
with torch.no_grad():
    y_pred = model_reg(X_te).numpy()
ss_res = np.sum((y_test.reshape(-1, 1) - y_pred) ** 2)
ss_tot = np.sum((y_test - y_test.mean()) ** 2)
print(f"\nRegularized Model Test R-squared: {1 - ss_res / ss_tot:.4f}")

* Stata: Regularized neural network via Python block

sysuse auto, clear
set seed 42
gen rand = runiform()
gen byte train = (rand <= 0.8)

python:
import torch
import torch.nn as nn
import copy
import numpy as np
from sfi import Data

# Get and prepare data
price  = np.array(Data.get("price"), dtype=np.float32)
mpg    = np.array(Data.get("mpg"), dtype=np.float32)
weight = np.array(Data.get("weight"), dtype=np.float32)
length = np.array(Data.get("length"), dtype=np.float32)
tr     = np.array(Data.get("train"))

X = np.column_stack([mpg, weight, length])
y = price.reshape(-1, 1)
mask = (tr == 1)

# Scale
X_mean, X_std = X[mask].mean(0), X[mask].std(0)
X_s = (X - X_mean) / X_std
y_mean, y_std = y[mask].mean(), y[mask].std()
y_s = (y - y_mean) / y_std

X_tr = torch.tensor(X_s[mask], dtype=torch.float32)
y_tr = torch.tensor(y_s[mask], dtype=torch.float32)
X_te = torch.tensor(X_s[~mask], dtype=torch.float32)
y_te = torch.tensor(y_s[~mask], dtype=torch.float32)

# Model with dropout
model = nn.Sequential(
    nn.Linear(3, 64), nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(64, 32), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(32, 1)
)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
criterion = nn.MSELoss()

best_loss = float('inf')
best_state = None

for epoch in range(200):
    model.train()
    loss = criterion(model(X_tr), y_tr)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    model.eval()
    with torch.no_grad():
        vl = criterion(model(X_te), y_te).item()
    if vl < best_loss:
        best_loss = vl
        best_state = copy.deepcopy(model.state_dict())
    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1} | Val Loss: {vl:.4f}")

model.load_state_dict(best_state)
print(f"Best validation loss: {best_loss:.4f}")
end

# R: Regularized neural network with dropout and early stopping
library(torch)

# (Assumes data prepared as in previous example)

# Model with dropout and batch normalization
net_reg <- nn_sequential(
  nn_linear(13, 128),
  nn_batch_norm1d(128),
  nn_relu(),
  nn_dropout(p = 0.3),

  nn_linear(128, 64),
  nn_batch_norm1d(64),
  nn_relu(),
  nn_dropout(p = 0.3),

  nn_linear(64, 32),
  nn_relu(),
  nn_dropout(p = 0.2),

  nn_linear(32, 1)
)

# Optimizer with weight decay
optimizer <- optim_adam(net_reg$parameters, lr = 0.001,
                       weight_decay = 1e-4)
criterion <- nn_mse_loss()

# Training with early stopping
best_val <- Inf
patience <- 15
wait <- 0

for (epoch in 1:200) {
  net_reg$train()
  pred <- net_reg(x_tr)
  loss <- criterion(pred, y_tr)
  optimizer$zero_grad()
  loss$backward()
  optimizer$step()

  net_reg$eval()
  val_loss <- criterion(net_reg(x_te), y_te)$item()

  if (val_loss < best_val) {
    best_val <- val_loss
    wait <- 0
  } else {
    wait <- wait + 1
    if (wait >= patience) {
      cat("Early stopping at epoch", epoch, "\n")
      break
    }
  }

  if (epoch %% 25 == 0) {
    cat(sprintf("Epoch %3d | Val: %.4f | Best: %.4f\n",
                epoch, val_loss, best_val))
  }
}

Python Output

Epoch  25 | Val Loss: 0.3892 | Best: 0.3756
Epoch  50 | Val Loss: 0.2845 | Best: 0.2701
Epoch  75 | Val Loss: 0.2634 | Best: 0.2589
Epoch 100 | Val Loss: 0.2612 | Best: 0.2534
Early stopping at epoch 118

Regularized Model Test R-squared: 0.8067

Stata Output

Epoch 50 | Val Loss: 0.4231
Epoch 100 | Val Loss: 0.3123
Epoch 150 | Val Loss: 0.2987
Best validation loss: 0.2845

R Output

Epoch  25 | Val: 18.4523 | Best: 17.8901
Epoch  50 | Val: 11.2345 | Best: 10.5678
Epoch  75 | Val: 9.1234 | Best: 8.7654
Epoch 100 | Val: 8.9012 | Best: 8.3456
Early stopping at epoch 112

Practical Implementation

Training a neural network involves several practical considerations that go beyond defining the architecture. Here is a checklist of the most important ones.

Neural Network Practitioner's Checklist

Feature scaling	Always standardize inputs to mean 0, std 1 using training set statistics. Unlike trees, neural networks are sensitive to feature scales. Forgetting this step can cause training to fail silently.
Batch size	Start with 32 or 64 for tabular data. Larger batches (256+) train faster on GPUs but may generalize worse. Smaller batches add noise that acts as implicit regularization.
Epochs	Set a large maximum (e.g., 300-500) and rely on early stopping to determine when to halt. Monitor both training and validation loss curves.
GPU acceleration	Move model and data to GPU with `.to('cuda')` for large datasets. For typical economics tabular data (up to millions of rows), CPU training is usually fast enough.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import pandas as pd
import copy
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# ============ 1. DATA PREPARATION ============
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# Split: 70% train, 15% validation, 15% test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_val_s   = scaler.transform(X_val)
X_test_s  = scaler.transform(X_test)

# Convert to tensors
X_tr = torch.tensor(X_train_s, dtype=torch.float32)
y_tr = torch.tensor(y_train.values, dtype=torch.float32).unsqueeze(1)
X_vl = torch.tensor(X_val_s, dtype=torch.float32)
y_vl = torch.tensor(y_val.values, dtype=torch.float32).unsqueeze(1)
X_te = torch.tensor(X_test_s, dtype=torch.float32)
y_te = torch.tensor(y_test.values, dtype=torch.float32).unsqueeze(1)

train_loader = DataLoader(
    TensorDataset(X_tr, y_tr), batch_size=64, shuffle=True
)

# ============ 2. MODEL DEFINITION ============
model = nn.Sequential(
    nn.Linear(8, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(256, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(128, 64), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(64, 1)
)

# ============ 3. TRAINING ============
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=10
)
criterion = nn.MSELoss()

best_val_loss = float('inf')
best_state = None
patience, wait = 25, 0

for epoch in range(300):
    model.train()
    for xb, yb in train_loader:
        loss = criterion(model(xb), yb)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    model.eval()
    with torch.no_grad():
        val_loss = criterion(model(X_vl), y_vl).item()
    scheduler.step(val_loss)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_state = copy.deepcopy(model.state_dict())
        wait = 0
    else:
        wait += 1
        if wait >= patience:
            print(f"Early stopping at epoch {epoch + 1}")
            break

# ============ 4. EVALUATION ============
model.load_state_dict(best_state)
model.eval()
with torch.no_grad():
    y_pred = model(X_te).numpy().flatten()

print("=== Final Evaluation on Test Set ===")
print(f"R-squared:          {r2_score(y_test, y_pred):.4f}")
print(f"RMSE:               {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"MAE:                {mean_absolute_error(y_test, y_pred):.4f}")

* Stata: Complete neural network pipeline via Python

sysuse auto, clear
set seed 42
gen rand = runiform()
gen byte split = cond(rand <= 0.7, 1, cond(rand <= 0.85, 2, 3))
* split: 1=train, 2=validation, 3=test

python:
import torch
import torch.nn as nn
import copy
import numpy as np
from sfi import Data

# Load data
price  = np.array(Data.get("price"), dtype=np.float32)
mpg    = np.array(Data.get("mpg"), dtype=np.float32)
weight = np.array(Data.get("weight"), dtype=np.float32)
length = np.array(Data.get("length"), dtype=np.float32)
split  = np.array(Data.get("split"))

X = np.column_stack([mpg, weight, length])
y = price.reshape(-1, 1)
tr_mask = (split == 1); vl_mask = (split == 2); te_mask = (split == 3)

# Scale
mu, sd = X[tr_mask].mean(0), X[tr_mask].std(0)
X_s = (X - mu) / sd

X_tr = torch.tensor(X_s[tr_mask], dtype=torch.float32)
y_tr = torch.tensor(y[tr_mask], dtype=torch.float32)
X_vl = torch.tensor(X_s[vl_mask], dtype=torch.float32)
y_vl = torch.tensor(y[vl_mask], dtype=torch.float32)
X_te = torch.tensor(X_s[te_mask], dtype=torch.float32)
y_te = torch.tensor(y[te_mask], dtype=torch.float32)

model = nn.Sequential(
    nn.Linear(3, 64), nn.BatchNorm1d(64), nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(64, 32), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(32, 1)
)

opt = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
crit = nn.MSELoss()

best_vl = float('inf')
best_st = None
for ep in range(200):
    model.train()
    loss = crit(model(X_tr), y_tr)
    opt.zero_grad(); loss.backward(); opt.step()
    model.eval()
    with torch.no_grad():
        vl = crit(model(X_vl), y_vl).item()
    if vl < best_vl:
        best_vl = vl; best_st = copy.deepcopy(model.state_dict())

model.load_state_dict(best_st)
model.eval()
with torch.no_grad():
    pred = model(X_te).numpy()
    r2 = 1 - np.sum((y[te_mask] - pred)**2) / np.sum((y[te_mask] - y[te_mask].mean())**2)
    rmse = np.sqrt(np.mean((y[te_mask] - pred)**2))
print(f"Test R2: {r2:.4f}")
print(f"Test RMSE: {rmse:.2f}")
end

# R: Complete neural network pipeline
library(torch)

# ---- 1. Data preparation ----
data("BostonHousing", package = "mlbench")
df <- BostonHousing

set.seed(42)
n <- nrow(df)
idx <- sample(n)
train_end <- floor(0.7 * n)
val_end   <- floor(0.85 * n)

train_df <- df[idx[1:train_end], ]
val_df   <- df[idx[(train_end + 1):val_end], ]
test_df  <- df[idx[(val_end + 1):n], ]

# Scale
x_tr_raw <- as.matrix(train_df[, -14])
x_tr <- scale(x_tr_raw)
ctr <- attr(x_tr, "scaled:center")
scl <- attr(x_tr, "scaled:scale")
x_vl <- scale(as.matrix(val_df[, -14]), center = ctr, scale = scl)
x_te <- scale(as.matrix(test_df[, -14]), center = ctr, scale = scl)

# Tensors
xt <- torch_tensor(x_tr, dtype = torch_float())
yt <- torch_tensor(as.matrix(train_df$medv), dtype = torch_float())
xv <- torch_tensor(x_vl, dtype = torch_float())
yv <- torch_tensor(as.matrix(val_df$medv), dtype = torch_float())
xe <- torch_tensor(x_te, dtype = torch_float())

# ---- 2. Model ----
net <- nn_sequential(
  nn_linear(13, 256), nn_batch_norm1d(256), nn_relu(), nn_dropout(0.3),
  nn_linear(256, 128), nn_batch_norm1d(128), nn_relu(), nn_dropout(0.3),
  nn_linear(128, 64), nn_relu(), nn_dropout(0.2),
  nn_linear(64, 1)
)

# ---- 3. Train ----
opt <- optim_adam(net$parameters, lr = 0.001, weight_decay = 1e-4)
crit <- nn_mse_loss()
best_val <- Inf

for (ep in 1:200) {
  net$train()
  loss <- crit(net(xt), yt)
  opt$zero_grad(); loss$backward(); opt$step()
  net$eval()
  vl <- crit(net(xv), yv)$item()
  if (vl < best_val) best_val <- vl
  if (ep %% 50 == 0) cat(sprintf("Epoch %3d | Val: %.4f\n", ep, vl))
}

# ---- 4. Evaluate ----
net$eval()
pred <- as.numeric(net(xe))
actual <- test_df$medv
r2 <- 1 - sum((actual - pred)^2) / sum((actual - mean(actual))^2)
rmse <- sqrt(mean((actual - pred)^2))
mae <- mean(abs(actual - pred))

cat("\n=== Final Test Evaluation ===\n")
cat("R-squared:", round(r2, 4), "\n")
cat("RMSE:     ", round(rmse, 4), "\n")
cat("MAE:      ", round(mae, 4), "\n")

Python Output

Early stopping at epoch 142

=== Final Evaluation on Test Set ===
R-squared:          0.8134
RMSE:               0.5198
MAE:                0.3612

Stata Output

Test R2: 0.7856
Test RMSE: 1932.45

R Output

Epoch  50 | Val: 14.2345
Epoch 100 | Val: 9.4567
Epoch 150 | Val: 8.1234
Epoch 200 | Val: 7.8901

=== Final Test Evaluation ===
R-squared: 0.8456
RMSE:      3.4521
MAE:       2.1234

When to Use Neural Networks in Economics

For the types of data most common in economics -- structured, tabular datasets with tens to hundreds of features and thousands to hundreds of thousands of observations -- neural networks are usually not the best choice. Tree-based methods (Random Forests, XGBoost) consistently match or outperform neural networks on tabular data, are easier to tune, and require less preprocessing. If your data fits in a spreadsheet, start with XGBoost.

Where neural networks truly shine is on unstructured data: text, images, audio, and other high-dimensional inputs. They are also valuable when you have very large datasets (millions of observations) or when you need to learn representations that transfer across tasks.

Neural Networks in Economics Research: Real Applications

Satellite imagery + CNNs: Processing nighttime light images or satellite photos to estimate regional GDP, poverty rates, or crop yields in areas with poor survey coverage.
Central bank text + NLP: Analyzing FOMC minutes, ECB press conferences, or earnings calls with recurrent or transformer networks to predict monetary policy or market reactions.
High-frequency finance: Processing millisecond-level order book data with LSTMs or temporal convolutions to model price formation or detect market manipulation.
Survey audio: Classifying respondent sentiment or detecting survey fatigue from voice recordings using audio-specialized architectures.
Causal ML first stages: Using neural networks as flexible first-stage estimators in IV or treatment effect estimation when the relationship between instruments and treatment is highly nonlinear.

Decision Guide: Which Method to Use

Use neural networks for: text data (NLP), image data (computer vision), very large datasets (millions of rows), tasks where transfer learning is beneficial, or when you need to learn complex representations from raw data.
Use tree-based methods (XGBoost, Random Forest) for: standard tabular data, moderate-size datasets, when you need variable importance, or when ease of tuning matters.
Use regularized regression (LASSO, Ridge) for: inference-adjacent prediction tasks, small samples, when interpretability is paramount, or when the true relationship is approximately linear.

A practical concern for policy-oriented economics research is interpretability. Neural networks are inherently opaque -- even with tools like SHAP values, the explanation is less transparent than for a linear model or a decision tree. If your goal is policy-relevant prediction (for example, predicting loan default to inform credit decisions), the interpretability costs of neural networks may outweigh their marginal predictive gains over simpler models.

Limitations

Key Limitations of Neural Networks

Data hungry: Neural networks require large amounts of data to train well. With small samples (a few hundred or even a few thousand observations, typical for many economic panel datasets), they are prone to overfitting and usually underperform simpler models like LASSO or Random Forests.
Computationally expensive: Training involves iterating over the data many times, computing gradients for thousands or millions of parameters. This is much slower than fitting an OLS regression or even training an XGBoost model.
Hard to interpret: Neural networks are black boxes. While SHAP values and other interpretation techniques exist, they provide approximations rather than clean, transparent explanations. This is a serious limitation for policy applications.
Sensitive to hyperparameters: The number of layers, neurons, learning rate, dropout rate, and batch size all affect performance significantly. There is no reliable theory to guide these choices -- you must rely on cross-validation and experimentation.
Feature scaling required: Unlike tree-based methods, neural networks require that features be standardized. Forgetting this step can cause training to fail silently -- the model may appear to train but produce poor results.
Overfitting on small data: A network with 10,000 parameters and 500 training observations is like fitting a polynomial of degree 10,000 -- it will memorize the training data perfectly and generalize terribly. Regularization helps but cannot fully compensate for insufficient data.

References

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. (Available free at deeplearningbook.org)
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.
Mullainathan, S., & Spiess, J. (2017). Machine Learning: An Applied Econometric Approach. Journal of Economic Perspectives, 31(2), 87-106.
Jean, N., Burke, M., Xie, M., Davis, W. M., Lobell, D. B., & Ermon, S. (2016). Combining satellite imagery and machine learning to predict poverty. Science, 353(6301), 790-794.
Ng, A. Stanford CS229: Machine Learning. Lecture Notes on Neural Networks.

11C Neural Networks

Learning Objectives

Contents

From Linear Models to Neural Networks

The Factory Assembly Line Analogy

A Single Neuron (Perceptron)

Network Architecture

Feedforward Neural Network Architecture

Activation Functions Visualized

Common Layer Types and What They Do

Training Neural Networks

Regression Losses

Classification Losses

SGD vs. Adam: Which Optimizer to Use

Regularization in Neural Networks

Overfitting: When to Stop Training

Practical Implementation

Neural Network Practitioner's Checklist

When to Use Neural Networks in Economics

Neural Networks in Economics Research: Real Applications

Decision Guide: Which Method to Use

Limitations

Key Limitations of Neural Networks

References

ProTools ER1 Assistant

ProTools ER1

Course Modules

11C Neural Networks

Learning Objectives

Contents

From Linear Models to Neural Networks

The Factory Assembly Line Analogy

A Single Neuron (Perceptron)

Network Architecture

Feedforward Neural Network Architecture

Activation Functions Visualized

Common Layer Types and What They Do

Training Neural Networks

Regression Losses

Classification Losses

SGD vs. Adam: Which Optimizer to Use

Regularization in Neural Networks

Overfitting: When to Stop Training

Practical Implementation

Neural Network Practitioner's Checklist

When to Use Neural Networks in Economics

Neural Networks in Economics Research: Real Applications

Decision Guide: Which Method to Use

Limitations

Key Limitations of Neural Networks

References

ProTools ER1 Assistant