8 Replicability & Reproducibility
Learning Objectives
- Understand the difference between replicability and reproducibility
- Organize research projects with clear folder structures
- Write documentation that enables others to run your code
- Create publication-ready replication packages
- Successfully replicate published economics papers
8.1 Definitions
| Concept | Definition | Question |
|---|---|---|
| Reproducibility | Same data + same code = same results | Can I re-run this and get identical output? |
| Replicability | New data + same methods = consistent findings | Does this hold in a different sample? |
8.2 Project Organization
A well-organized project folder is the foundation of reproducible research. Here's what a complete research project should look like:
Key Principles
- Raw data is sacred: Never modify original files—all cleaning happens in code
- One master script: Runs entire analysis end-to-end with a single click
- Relative paths only: Never use absolute paths like
C:\Users\... - Number your scripts: Makes execution order obvious (00, 01, 02...)
- Version everything: Use Git (see Module 9)
- Separate outputs: Tables, figures, and logs in their own folders
8.3 Understanding File Paths
File paths tell your computer where to find files. Understanding them is essential for writing code that works on any machine.
Absolute vs. Relative Paths
| Type | Example | Verdict |
|---|---|---|
| Absolute Path | C:\Users\Maria\Projects\thesis\data\raw\survey.dta |
❌ Never use in code |
| Relative Path | data/raw/survey.dta |
✓ Always use this |
An absolute path specifies the complete location from the root of the file system — it starts with / on Mac/Linux (e.g., /Users/giulia/Projects/thesis/) or a drive letter on Windows (e.g., C:\Users\...). A relative path starts from your current location (the "working directory") and contains no root prefix — think of it like giving directions from where you already are: "go to the data folder, then the raw subfolder, then find survey.dta."
/ vs. \) does not determine whether a path is absolute or relative. Both /Users/giulia/data.csv (forward slash) and C:\Users\Giulia\data.csv (backslash) are absolute paths; both data/raw/survey.csv and data\raw\survey.csv are relative paths. What matters is whether the path starts from the root of the file system or from the current working directory — not which slash separates the folders.
Path Syntax: Mac/Linux vs. Windows
# Mac/Linux use forward slashes /
/Users/giulia/Projects/thesis/ # Absolute path
data/raw/survey.csv # Relative path
# Navigate up one folder with ..
../other_project/data.csv # Go up, then into other_project
# Home directory shortcut
~/Projects/thesis/ # ~ means your home folder
# Windows traditionally uses backslashes \
C:\Users\Giulia\Projects\thesis\ # Absolute path
data\raw\survey.csv # Relative path
# But forward slashes work in most modern software!
data/raw/survey.csv # This works in Stata, Python, R
# Navigate up one folder with ..
..\other_project\data.csv # Go up, then into other_project
/) everywhere, even on Windows. Stata, Python, and R all understand them, making your code cross-platform compatible.
Setting the Working Directory
The working directory is your "home base"—all relative paths start from here. Set it once at the beginning of your master script.
* Check current working directory
pwd
* Set working directory (do this ONCE in master script)
cd "/Users/giulia/Projects/minimum-wage-study"
* Now all paths are relative to this location
use "data/raw/cps_2019.dta", clear
export delimited "data/processed/clean_data.csv"
# Check current working directory
getwd()
# Set working directory
setwd("/Users/giulia/Projects/minimum-wage-study")
# Better: use here package (finds project root automatically)
library(here)
data <- read_csv(here("data", "raw", "cps_2019.csv"))
# Or use RStudio Projects (.Rproj files)
import os
from pathlib import Path
# Check current working directory
print(os.getcwd())
# Set working directory
os.chdir("/Users/giulia/Projects/minimum-wage-study")
# Better: use pathlib for cross-platform paths
project_root = Path(__file__).parent.parent # Go up from code/ folder
data_path = project_root / "data" / "raw" / "cps_2019.csv"
8.4 The Master Script
The master script is the single entry point that reproduces your entire analysis. A reviewer or co-author should be able to run this one file and regenerate all results.
- Clear header with project info and instructions
- One location to set the root path (the only absolute path in the entire project)
- Global macros for all folder paths
- Sequential execution of all sub-scripts
- Timestamps and logging
/*==============================================================================
MASTER DO-FILE: Minimum Wage and Employment
Paper: "The Employment Effects of Minimum Wage Increases"
Authors: Smith, J. and Jones, M.
Version: 1.0
Date: January 2025
INSTRUCTIONS:
1. Set the root path below to your project folder location
2. Ensure all required packages are installed (see below)
3. Run this entire file to reproduce all results
REQUIRED PACKAGES: estout, reghdfe, ftools, coefplot
Install with: ssc install estout, replace
================================================================================*/
clear all
set more off
cap log close
*------------------------------------------------------------------------------*
* SET ROOT PATH - MODIFY THIS LINE ONLY
*------------------------------------------------------------------------------*
* Users should change this path to their own project folder location
if "`c(username)'" == "giulia" {
global root "/Users/giulia/Projects/minimum-wage-study"
}
else if "`c(username)'" == "john" {
global root "C:/Users/john/Research/minimum-wage-study"
}
else {
* Default: assume current directory is project root
global root "`c(pwd)'"
}
*------------------------------------------------------------------------------*
* DEFINE FOLDER PATHS (do not modify)
*------------------------------------------------------------------------------*
global data "$root/data"
global raw "$data/raw"
global processed "$data/processed"
global code "$root/code"
global output "$root/output"
global tables "$output/tables"
global figures "$output/figures"
global logs "$output/logs"
*------------------------------------------------------------------------------*
* CREATE FOLDERS IF THEY DON'T EXIST
*------------------------------------------------------------------------------*
cap mkdir "$processed"
cap mkdir "$output"
cap mkdir "$tables"
cap mkdir "$figures"
cap mkdir "$logs"
*------------------------------------------------------------------------------*
* START LOG FILE
*------------------------------------------------------------------------------*
local datetime: di %tcCCYY-NN-DD-HH-MM-SS Clock("`c(current_date)' `c(current_time)'","DMYhms")
log using "$logs/master_log_`datetime'.txt", text replace
di "=============================================="
di " MASTER SCRIPT STARTED"
di " Date/Time: `c(current_date)' `c(current_time)'"
di " Stata version: `c(stata_version)'"
di " Root directory: $root"
di "=============================================="
*------------------------------------------------------------------------------*
* RUN ANALYSIS SCRIPTS IN ORDER
*------------------------------------------------------------------------------*
di _n ">>> Running 01_import_clean.do..."
do "$code/01_import_clean.do"
di _n ">>> Running 02_merge_construct.do..."
do "$code/02_merge_construct.do"
di _n ">>> Running 03_descriptive_stats.do..."
do "$code/03_descriptive_stats.do"
di _n ">>> Running 04_main_analysis.do..."
do "$code/04_main_analysis.do"
di _n ">>> Running 05_robustness.do..."
do "$code/05_robustness.do"
*------------------------------------------------------------------------------*
* COMPLETION
*------------------------------------------------------------------------------*
di _n "=============================================="
di " MASTER SCRIPT COMPLETED SUCCESSFULLY"
di " End time: `c(current_date)' `c(current_time)'"
di " All outputs saved to: $output"
di "=============================================="
log close
#===============================================================================
# MASTER SCRIPT: Minimum Wage and Employment
#
# Paper: "The Employment Effects of Minimum Wage Increases"
# Authors: Smith, J. and Jones, M.
# Version: 1.0
# Date: January 2025
#
# INSTRUCTIONS:
# 1. Open the .Rproj file in RStudio (sets working directory automatically)
# 2. Run this entire script to reproduce all results
#
# REQUIRED PACKAGES: tidyverse, fixest, modelsummary, here
#===============================================================================
# Clear environment
rm(list = ls())
# Load packages (install if needed)
required_packages <- c("tidyverse", "fixest", "modelsummary", "here", "haven")
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
library(tidyverse)
library(fixest)
library(modelsummary)
library(here)
library(haven)
#------------------------------------------------------------------------------
# DEFINE PATHS (using here package - no manual path setting needed!)
#------------------------------------------------------------------------------
paths <- list(
raw = here("data", "raw"),
processed = here("data", "processed"),
tables = here("output", "tables"),
figures = here("output", "figures"),
logs = here("output", "logs")
)
# Create directories if they don't exist
walk(paths, ~dir.create(.x, recursive = TRUE, showWarnings = FALSE))
#------------------------------------------------------------------------------
# START LOG
#------------------------------------------------------------------------------
sink(file.path(paths$logs, paste0("master_log_", Sys.Date(), ".txt")))
cat("==============================================\n")
cat(" MASTER SCRIPT STARTED\n")
cat(" Date/Time:", format(Sys.time()), "\n")
cat(" R version:", R.version.string, "\n")
cat(" Project root:", here(), "\n")
cat("==============================================\n\n")
#------------------------------------------------------------------------------
# RUN ANALYSIS SCRIPTS IN ORDER
#------------------------------------------------------------------------------
cat("\n>>> Running 01_import_clean.R...\n")
source(here("code", "01_import_clean.R"))
cat("\n>>> Running 02_merge_construct.R...\n")
source(here("code", "02_merge_construct.R"))
cat("\n>>> Running 03_descriptive_stats.R...\n")
source(here("code", "03_descriptive_stats.R"))
cat("\n>>> Running 04_main_analysis.R...\n")
source(here("code", "04_main_analysis.R"))
cat("\n>>> Running 05_robustness.R...\n")
source(here("code", "05_robustness.R"))
#------------------------------------------------------------------------------
# COMPLETION
#------------------------------------------------------------------------------
cat("\n==============================================\n")
cat(" MASTER SCRIPT COMPLETED SUCCESSFULLY\n")
cat(" End time:", format(Sys.time()), "\n")
cat(" All outputs saved to:", here("output"), "\n")
cat("==============================================\n")
sink()
"""
================================================================================
MASTER SCRIPT: Minimum Wage and Employment
Paper: "The Employment Effects of Minimum Wage Increases"
Authors: Smith, J. and Jones, M.
Version: 1.0
Date: January 2025
INSTRUCTIONS:
1. Navigate to project folder in terminal
2. Run: python code/00_master.py
REQUIRED PACKAGES: pandas, numpy, statsmodels, matplotlib, linearmodels
Install with: pip install -r requirements.txt
================================================================================
"""
import os
import sys
from pathlib import Path
from datetime import datetime
import subprocess
#------------------------------------------------------------------------------
# DEFINE PATHS
#------------------------------------------------------------------------------
# Get project root (parent of code folder)
ROOT = Path(__file__).resolve().parent.parent
PATHS = {
'raw': ROOT / 'data' / 'raw',
'processed': ROOT / 'data' / 'processed',
'tables': ROOT / 'output' / 'tables',
'figures': ROOT / 'output' / 'figures',
'logs': ROOT / 'output' / 'logs',
'code': ROOT / 'code'
}
# Create directories if they don't exist
for path in PATHS.values():
path.mkdir(parents=True, exist_ok=True)
#------------------------------------------------------------------------------
# LOGGING SETUP
#------------------------------------------------------------------------------
log_file = PATHS['logs'] / f"master_log_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}.txt"
class Logger:
def __init__(self, filename):
self.terminal = sys.stdout
self.log = open(filename, 'w')
def write(self, message):
self.terminal.write(message)
self.log.write(message)
def flush(self):
self.terminal.flush()
self.log.flush()
sys.stdout = Logger(log_file)
#------------------------------------------------------------------------------
# START
#------------------------------------------------------------------------------
print("=" * 50)
print(" MASTER SCRIPT STARTED")
print(f" Date/Time: {datetime.now()}")
print(f" Python version: {sys.version}")
print(f" Project root: {ROOT}")
print("=" * 50)
#------------------------------------------------------------------------------
# RUN ANALYSIS SCRIPTS IN ORDER
#------------------------------------------------------------------------------
scripts = [
'01_import_clean.py',
'02_merge_construct.py',
'03_descriptive_stats.py',
'04_main_analysis.py',
'05_robustness.py'
]
for script in scripts:
print(f"\n>>> Running {script}...")
script_path = PATHS['code'] / script
exec(open(script_path).read())
#------------------------------------------------------------------------------
# COMPLETION
#------------------------------------------------------------------------------
print("\n" + "=" * 50)
print(" MASTER SCRIPT COMPLETED SUCCESSFULLY")
print(f" End time: {datetime.now()}")
print(f" All outputs saved to: {ROOT / 'output'}")
print("=" * 50)
8.5 Documentation: The README
Every project needs a README answering: What is this? What data? What software? How to run?
# Project Title
## Overview
Brief description of the research question and findings.
## Data
- **Source**: Where data comes from (e.g., "CPS via IPUMS")
- **Access**: How to obtain it (include URLs or instructions)
- **Files**: List of data files and what they contain
## Software Requirements
- Stata 17 or later
- Required packages: estout, reghdfe, ftools
## Instructions
1. Clone or download this repository
2. Open `code/00_master.do`
3. Set the root path on line 25 to your project folder
4. Run the master file
## File Structure
[Include folder tree here]
## Output
Running the master script will generate:
- Tables 1-5 in `output/tables/`
- Figures 1-3 in `output/figures/`
## Contact
Your Name (email@university.edu)
8.6 Automated LaTeX Tables
Never copy-paste results into tables manually. Export publication-ready LaTeX tables directly from your statistical software.
- Run regression in Stata
- Copy numbers from output window
- Paste into Excel
- Format in Excel
- Copy to Word or manually type into LaTeX
Problems: Error-prone, time-consuming, not reproducible, nightmare when results change
- Run regression in Stata
- Export to .tex file with one command
- LaTeX imports table automatically
- Results change? Just re-run the code!
Stata: esttab / estout
The estout package is the gold standard for exporting Stata results to LaTeX.
* Install the package (once)
ssc install estout, replace
* Run your regressions and store estimates
reg earnings education age, robust
estimates store m1
reg earnings education age experience, robust
estimates store m2
reg earnings education age experience i.industry, robust
estimates store m3
* Export to LaTeX
esttab m1 m2 m3 using "$tables/main_results.tex", ///
replace ///
label /// /// Use variable labels
b(3) se(3) /// /// 3 decimal places
star(* 0.10 ** 0.05 *** 0.01) /// /// Significance stars
title("Effect of Education on Earnings") ///
mtitles("Basic" "Controls" "Industry FE") /// /// Column titles
scalars("r2 R-squared" "N Observations") ///
nonotes ///
addnotes("Standard errors in parentheses." ///
"* p<0.10, ** p<0.05, *** p<0.01")
This produces a complete LaTeX table that looks like:
| (1) Basic | (2) Controls | (3) Industry FE | |
|---|---|---|---|
| Education (years) | 0.089*** | 0.074*** | 0.068*** |
| (0.003) | (0.003) | (0.004) | |
| Age | 0.012*** | 0.008*** | 0.007*** |
| (0.001) | (0.001) | (0.001) | |
| R-squared | 0.142 | 0.187 | 0.234 |
| Observations | 45,231 | 45,231 | 45,231 |
Standard errors in parentheses. * p<0.10, ** p<0.05, *** p<0.01
R: modelsummary
library(modelsummary)
library(fixest)
# Run regressions
m1 <- feols(earnings ~ education + age, data = df)
m2 <- feols(earnings ~ education + age + experience, data = df)
m3 <- feols(earnings ~ education + age + experience | industry, data = df)
# Export to LaTeX
modelsummary(
list("Basic" = m1, "Controls" = m2, "Industry FE" = m3),
output = here("output", "tables", "main_results.tex"),
stars = c('*' = 0.10, '**' = 0.05, '***' = 0.01),
coef_rename = c("education" = "Education (years)", "age" = "Age"),
gof_omit = "AIC|BIC|Log",
title = "Effect of Education on Earnings",
notes = "Standard errors in parentheses."
)
modelsummary produces the same table structure as the Stata example above. Here is what the rendered output looks like:
| (1) Basic | (2) Controls | (3) Industry FE | |
|---|---|---|---|
| Education (years) | 0.089*** | 0.074*** | 0.068*** |
| (0.003) | (0.003) | (0.004) | |
| Age | 0.012*** | 0.008*** | 0.007*** |
| (0.001) | (0.001) | (0.001) | |
| R-squared | 0.142 | 0.187 | 0.234 |
| Observations | 45,231 | 45,231 | 45,231 |
Standard errors in parentheses. * p<0.10, ** p<0.05, *** p<0.01
Python: stargazer
from stargazer.stargazer import Stargazer
import statsmodels.api as sm
# Run regressions
X1 = sm.add_constant(df[['education', 'age']])
m1 = sm.OLS(df['earnings'], X1).fit(cov_type='HC1')
X2 = sm.add_constant(df[['education', 'age', 'experience']])
m2 = sm.OLS(df['earnings'], X2).fit(cov_type='HC1')
# Third model: add industry fixed effects via dummies
industry_dummies = pd.get_dummies(df['industry'], drop_first=True, dtype=int)
X3 = sm.add_constant(pd.concat([df[['education', 'age', 'experience']], industry_dummies], axis=1))
m3 = sm.OLS(df['earnings'], X3).fit(cov_type='HC1')
# Export to LaTeX
stargazer = Stargazer([m1, m2, m3])
stargazer.title("Effect of Education on Earnings")
stargazer.custom_columns(["Basic", "Controls", "Industry FE"])
with open(PATHS['tables'] / 'main_results.tex', 'w') as f:
f.write(stargazer.render_latex())
Python's stargazer produces a similar regression table. Here is what the rendered output looks like:
| (1) Basic | (2) Controls | (3) Industry FE | |
|---|---|---|---|
| Education (years) | 0.089*** | 0.074*** | 0.068*** |
| (0.003) | (0.003) | (0.004) | |
| Age | 0.012*** | 0.008*** | 0.007*** |
| (0.001) | (0.001) | (0.001) | |
| R-squared | 0.142 | 0.187 | 0.234 |
| Observations | 45,231 | 45,231 | 45,231 |
Standard errors in parentheses. * p<0.10, ** p<0.05, *** p<0.01
Including Tables in LaTeX
In your paper's .tex file, simply input the generated table:
\begin{table}[htbp]
\centering
\input{../output/tables/main_results.tex}
\end{table}
What If You Use Word Instead of LaTeX?
Many students write their theses in Word rather than LaTeX, and that is perfectly fine. The important principle is not which word processor you use, but that you never copy-paste numbers manually from your statistical output into your document. All three languages can export tables directly to Word-friendly formats.
# modelsummary can export directly to a Word document
modelsummary(
list("Basic" = m1, "Controls" = m2, "Industry FE" = m3),
output = here("output", "tables", "main_results.docx"),
stars = c('*' = 0.10, '**' = 0.05, '***' = 0.01)
)
* esttab can export to RTF format, which Word opens natively
esttab m1 m2 m3 using "$tables/main_results.rtf", ///
replace label b(3) se(3) ///
star(* 0.10 ** 0.05 *** 0.01) ///
mtitles("Basic" "Controls" "Industry FE")
# Option 1: Save as HTML, then open in Word
with open(PATHS['tables'] / 'main_results.html', 'w') as f:
f.write(stargazer.render_html())
# Option 2: For simpler tables, export via pandas
summary_df.to_excel(PATHS['tables'] / 'summary_stats.xlsx', index=False)
\input{} does. However, the workflow is still far better than copy-pasting: whenever your results change, simply re-run your script to regenerate the Word/RTF/HTML file, then replace the old table in your document. This takes seconds and eliminates transcription errors.
8.7 One-Click Pipeline: Full Integration
The ultimate goal is a workflow where one command reproduces your entire paper—data cleaning, analysis, tables, figures, and the compiled PDF.
The Dream Workflow
One command. Fully reproducible. No manual intervention.
Option 1: Makefile (Most Powerful)
A Makefile is a special file that tells your computer how to build your project, step by step. Think of a Makefile as a recipe book that knows which ingredients are already prepared. If you have already cleaned your data and nothing has changed since then, make will skip that step and jump straight to the analysis. This saves enormous amounts of time on large projects where a full pipeline might take hours to run.
Researchers use Makefiles because they capture the dependency structure of a project. Each "rule" in the Makefile says: "To produce this output file, I need these input files, and here is the command to run." When you type make all in your terminal, the tool inspects which output files are missing or older than their inputs and re-runs only those steps. Everything else is left untouched.
Makefiles are not written in any programming language like Python or Stata -- they use their own simple declarative syntax understood by the make utility. Each rule declares an output, its inputs, and a shell command to run. The commands inside each rule (like stata-mp -b do ...) are regular shell/bash commands -- the Makefile just orchestrates when they execute. You type make all in your terminal, from the project's root directory (wherever the Makefile is saved).
On Mac/Linux, make is usually pre-installed. Windows users can get it through Git Bash, WSL (Windows Subsystem for Linux), or by installing GNU Make via Chocolatey.
As you read the code below, pay attention to the pattern on each rule: the output file appears before the colon, the input files appear after the colon, and the command is indented on the next line. Notice how paper/main.pdf depends on both the LaTeX source and the generated tables and figures -- so the paper will only recompile when something it uses has actually changed.
# Makefile for research project
# Run with: make all
.PHONY: all clean data analysis paper
# Default target
all: paper
# Data processing (depends on raw data)
data/processed/clean_data.dta: data/raw/cps_2019.dta code/01_import_clean.do
stata-mp -b do code/01_import_clean.do
# Analysis (depends on processed data)
output/tables/main_results.tex: data/processed/clean_data.dta code/04_main_analysis.do
stata-mp -b do code/04_main_analysis.do
output/figures/figure1.pdf: data/processed/clean_data.dta code/04_main_analysis.do
stata-mp -b do code/04_main_analysis.do
# Paper (depends on tables and figures)
paper/main.pdf: paper/main.tex output/tables/main_results.tex output/figures/figure1.pdf
cd paper && pdflatex main.tex && bibtex main && pdflatex main.tex && pdflatex main.tex
paper: paper/main.pdf
# Clean all generated files
clean:
rm -f data/processed/*.dta
rm -f output/tables/*.tex
rm -f output/figures/*.pdf
rm -f paper/*.pdf paper/*.aux paper/*.log paper/*.bbl paper/*.blg
Option 2: Shell Script (Simpler)
If a Makefile is a smart recipe book, a shell script is a simple checklist: it runs every step from top to bottom, every time, regardless of what has changed. This makes shell scripts easier to write and understand, which is why they are a perfectly good choice for small projects or when your full pipeline runs in a few minutes anyway. You create a file (typically called run_all.sh), list your commands in order, and execute it with bash run_all.sh in your terminal, from the project's root directory (where the script is saved). The downside is that it will redo work even when nothing has changed, but for many student projects that trade-off is fine.
#!/bin/bash
# run_all.sh - Reproduce entire paper
echo "Starting full pipeline..."
echo "========================="
# Run Stata analysis
echo "Step 1: Running Stata analysis..."
stata-mp -b do code/00_master.do
# Run Python visualizations (if any)
echo "Step 2: Running Python scripts..."
python code/06_visualizations.py
# Compile LaTeX paper
echo "Step 3: Compiling paper..."
cd paper
pdflatex main.tex
bibtex main
pdflatex main.tex
pdflatex main.tex
echo "========================="
echo "Done! Paper saved to paper/main.pdf"
Option 3: Overleaf + GitHub Sync
Options 1 and 2 handle the analysis pipeline -- running your code and generating outputs (tables, figures). But if your team writes the paper on Overleaf (a cloud-based LaTeX editor), you also need a way to get those generated tables and figures into your Overleaf project. That is what this option addresses: connecting your local pipeline's outputs to Overleaf via GitHub.
How to connect Overleaf and GitHub
- Create a GitHub repository for your project (or use an existing one). Make sure your
output/folder with generated tables and figures is committed to this repo. - In Overleaf, open your project and go to Menu (top-left corner) → GitHub → Link to GitHub Repository.
- Paste your GitHub repo URL and authorize Overleaf to access it.
- Push and pull between Overleaf and GitHub: in Overleaf, use Menu → GitHub → Push/Pull to sync changes in either direction.
- Workflow: run your analysis locally (via
make allorbash run_all.sh) → outputs saved tooutput/folder →git pushto GitHub → in Overleaf, pull from GitHub → paper recompiles with new tables and figures.
After running your pipeline, push the updated outputs to GitHub so Overleaf can pull them in. For a complete guide to Git commands like git add, git commit, and git push, see Module 9.
# After running your analysis pipeline
git add output/tables/*.tex output/figures/*.pdf
git commit -m "Update results"
git push
# Then in Overleaf: Menu → GitHub → Pull from GitHub
- One command, full paper. The goal of a reproducible pipeline is that a single command (like
make allorbash run_all.sh) rebuilds everything from raw data to compiled PDF. - Makefiles are smart. They track dependencies between files and only re-run steps whose inputs have changed. Best for large or slow-running projects.
- Shell scripts are simple. They run every step from top to bottom every time. A good starting point for smaller projects where the full pipeline finishes quickly.
- Version control ties it together. Committing your outputs to Git (and optionally syncing with Overleaf) ensures that every version of your paper corresponds to a traceable set of results.
- No manual steps. If your workflow requires you to "remember" to copy a file, rename an output, or run things in a particular order by hand, it is not yet fully reproducible. Automate it.
8.8 LaTeX Paper Structure
Large papers should be split into sections for easier editing and collaboration. Use a master file that inputs each section.
Project Structure
Master File (main.tex)
Why split a LaTeX paper into multiple files instead of writing everything in one long document? For the same reason you split code into separate scripts: it keeps things manageable. Each section lives in its own .tex file, and the master file (main.tex) pulls them together using \input{} commands. The \input{} command simply pastes the contents of another file at that point -- think of it as an #include or import statement. This means you can work on the methodology section without scrolling past 20 pages of literature review. It also makes collaboration easier: two co-authors can edit different section files without creating merge conflicts.
Notice that the preamble (all the package imports and settings) is kept in a separate file called preamble.tex. This is good practice because preamble code rarely changes and would otherwise clutter the top of your main file. Keeping it separate lets you focus on the document structure when you open main.tex.
% main.tex - Master file for paper
\documentclass[12pt]{article}
% Load preamble (packages, settings, custom commands)
\input{preamble}
\title{The Employment Effects of Minimum Wage Increases}
\author{Jane Smith\thanks{University of Example. Email: jsmith@example.edu}}
\date{\today}
\begin{document}
\maketitle
\begin{abstract}
Your abstract here...
\end{abstract}
% Main content - each section in separate file
\input{sections/01_introduction}
\input{sections/02_literature}
\input{sections/03_data}
\input{sections/04_methodology}
\input{sections/05_results}
\input{sections/06_conclusion}
% Bibliography
\bibliographystyle{aer}
\bibliography{bibliography}
% Appendix
\appendix
\input{sections/appendix}
\end{document}
Preamble File (preamble.tex)
% preamble.tex - Packages and settings
% Essential packages
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amsmath,amssymb}
\usepackage{graphicx}
\usepackage{booktabs} % Professional tables
\usepackage{natbib} % Citations
\usepackage{hyperref} % Clickable links
\usepackage{setspace} % Line spacing
\usepackage[margin=1in]{geometry}
% For including Stata/R output tables
\usepackage{threeparttable} % Table notes
\usepackage{tabularx} % Flexible columns
\usepackage{siunitx} % Number alignment
% Path to figures and tables (relative to main.tex)
\graphicspath{{../output/figures/}{figures/}}
% Custom commands for convenience
\newcommand{\tabpath}{../output/tables/}
% Double spacing for submission
\doublespacing
Including Generated Tables
% In sections/05_results.tex
\section{Results}
Table \ref{tab:main} presents our main results...
\begin{table}[htbp]
\centering
\caption{Effect of Minimum Wage on Employment}
\label{tab:main}
\input{\tabpath main_results.tex}
\end{table}
The coefficient on minimum wage in column (3) suggests that...
Compiling the Paper
A common surprise for beginners: compiling a LaTeX paper with a bibliography requires running pdflatex multiple times, not just once. Here is why. On the first pass, pdflatex reads your document and notes where you cite references and cross-reference tables or figures, but it does not know the final page numbers or citation details yet -- it writes those questions to auxiliary files. Then bibtex reads those auxiliary files, looks up your .bib bibliography database, and generates the formatted reference list. The second pdflatex pass incorporates the bibliography and resolves most cross-references. A third pass is needed to finalize any remaining references (like "see Table 3 on page 12") that shifted when the bibliography was inserted. If you only run pdflatex once, you will see question marks (??) wherever a citation or cross-reference should appear. The shortcut latexmk -pdf handles all of this automatically by running as many passes as needed.
# From the paper/ directory
cd paper
# Full compilation with bibliography
pdflatex main.tex # First pass
bibtex main # Process citations
pdflatex main.tex # Resolve references
pdflatex main.tex # Final pass
# Or use latexmk (handles everything automatically)
latexmk -pdf main.tex
- Auto-compilation on save
- PDF preview side-by-side
- Syntax highlighting and snippets
- Forward/inverse search (click in PDF → go to code)
8.9 Replicating Published Papers
Where to find replication materials:
- AEA Data Repository: openicpsr.org
- Harvard Dataverse: dataverse.harvard.edu
- Author websites and GitHub
Steps to replicate:
- Find and download replication package
- Read the paper and README carefully
- Install required software/packages
- Run the master script
- Compare output to published tables
Gentzkow, M. & Shapiro, J. (2014). "Code and Data for the Social Sciences"