5 Data Analysis & Visualization
Learning Objectives
- Organize research projects with clear script naming conventions
- Create and maintain README files for reproducibility
- Compute and interpret descriptive statistics
- Create publication-quality visualizations
- Run and interpret regression analyses
- Generate summary tables for papers
Table of Contents
5.1 Project Organization
Before diving into analysis, establishing a clear project structure is essential. A well-organized project allows you (and others) to understand what each script does and the order in which to run them. This becomes critical as projects grow in complexity and when sharing code for replication.
Script Naming Conventions
Script names should reveal what they do and the order to run them. The most effective approach uses numbered prefixes combined with descriptive names:
##_action_description.ext
where ## is a two-digit number (01, 02, 03...) indicating execution order.
Example Project Structure
project/
├── data/
│ ├── raw/ # Original, untouched data
│ │ ├── census_2020.csv
│ │ └── survey_responses.xlsx
│ ├── processed/ # Cleaned, analysis-ready data
│ │ └── analysis_sample.csv
│ └── temp/ # Intermediate files
│
├── code/
│ ├── 01_import_data.py # Load raw data
│ ├── 02_clean_data.py # Handle missing values, outliers
│ ├── 03_merge_datasets.py # Combine data sources
│ ├── 04_create_variables.py # Generate analysis variables
│ ├── 05_descriptive_stats.py # Summary statistics, EDA
│ ├── 06_regression_analysis.py # Main estimation
│ ├── 07_robustness_checks.py # Sensitivity analyses
│ └── 08_create_figures.py # Publication plots
│
├── output/
│ ├── tables/
│ └── figures/
│
└── README.md # Project documentation
Naming Best Practices
| Practice | Good | Avoid |
|---|---|---|
| Use numbers for order | 01_import.py |
import.py |
| Be descriptive | 03_clean_income_vars.do |
03_clean.do |
| Use underscores | 05_merge_datasets.R |
05 merge datasets.R |
| Lowercase | 02_data_prep.py |
02_Data_Prep.py |
| Skip numbers for flexibility | 01, 02, 05, 10 |
1, 2, 3, 4 |
Leave gaps in your numbering (01, 02, 05, 10...) so you can insert new scripts without renaming everything. If you later need a script between 02_clean.py and 05_merge.py, you can create 03_validate.py without disrupting the sequence.
Language-Specific Conventions
# Python: Typical project structure
# code/
# 01_import_data.py
# 02_clean_data.py
# 03_analysis.py
# utils/ # Helper functions
# __init__.py
# data_helpers.py
# config.py # Settings, paths
# In config.py:
from pathlib import Path
PROJECT_ROOT = Path(__file__).parent.parent
DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
OUTPUT = PROJECT_ROOT / "output"
* Stata: Master do-file pattern
* Create a 00_master.do that runs everything
* ============================================
* 00_master.do - Run entire analysis pipeline
* ============================================
global PROJECT "/Users/researcher/my_project"
global CODE "$PROJECT/code"
global DATA "$PROJECT/data"
global OUTPUT "$PROJECT/output"
* Run scripts in order
do "$CODE/01_import_data.do"
do "$CODE/02_clean_data.do"
do "$CODE/03_merge_datasets.do"
do "$CODE/04_analysis.do"
do "$CODE/05_create_tables.do"
# R: Typical project structure
# Using here() package for portable paths
# code/
# 00_main.R # Master script
# 01_load_data.R
# 02_clean_data.R
# 03_analysis.R
# R/ # Reusable functions
# utils.R
# In 00_main.R:
library(here)
# Source all scripts in order
source(here("code", "01_load_data.R"))
source(here("code", "02_clean_data.R"))
source(here("code", "03_analysis.R"))
5.2 README Files
A README file is the front door to your project. It tells collaborators (including your future self) what the project does, how to run it, and what to expect. Every research project should have one.
This section introduces README basics. For comprehensive documentation practices, including data dictionaries, codebooks, and replication packages, see Module 8: Replicability.
Essential README Elements
At minimum, your README should answer these questions:
- What is this project? - Brief description of the research question
- How do I run it? - Step-by-step instructions to reproduce results
- What do I need? - Software requirements, data sources
- What does each file do? - Description of key scripts
README Template
# Project Title
Brief description of your research project and main findings.
## Data
- **Source:** Describe where data comes from
- **Access:** How to obtain the data (public/restricted)
- **Files:** List main data files
## Requirements
### Software
- Python 3.9+ with packages in `requirements.txt`
- Stata 17 (for some analyses)
- R 4.2+ with packages in `renv.lock`
### Installation
```bash
pip install -r requirements.txt
```
## How to Replicate
1. Clone this repository
2. Download data from [source] and place in `data/raw/`
3. Run scripts in order:
- `01_import_data.py` - Load and validate raw data
- `02_clean_data.py` - Handle missing values, create variables
- `03_analysis.py` - Main regression analysis
- `04_figures.py` - Generate all figures
Or run everything with:
```bash
python 00_main.py
```
## File Structure
```
├── code/ # Analysis scripts
├── data/ # Data files (not tracked in git)
├── output/ # Tables and figures
└── README.md # This file
```
## Authors
Your Name (your.email@university.edu)
## License
MIT License (or appropriate license)
README for Different Audiences
| Audience | Key Information |
|---|---|
| Replicator | Exact steps to reproduce results, software versions, data access |
| Collaborator | Code organization, where to add new analyses, coding conventions |
| Future You | What you were thinking, why certain decisions were made |
| Reviewer/Editor | What data is included, what is sensitive, how results map to paper |
5.3 Descriptive Statistics
Before computing statistics, let's load a dataset to work with throughout this module. We'll use the NLSW88 dataset — a real open-source economics dataset from the National Longitudinal Survey of Young Women (1988) with labor market data for 2,246 women.
import pandas as pd
import numpy as np
# Load the NLSW88 dataset (National Longitudinal Survey of Young Women, 1988)
df = pd.read_stata('https://www.stata-press.com/data/r18/nlsw88.dta')
# Rename columns to match this module's examples
df = df.rename(columns={'wage': 'income', 'grade': 'education',
'ttl_exp': 'experience'})
print(f"Loaded {len(df)} observations and {len(df.columns)} variables")
print(df.head())
* Load the NLSW88 dataset (National Longitudinal Survey of Young Women, 1988)
webuse nlsw88, clear
* Rename variables to match this module's examples
rename wage income
rename grade education
rename ttl_exp experience
describe, short
# Load the NLSW88 dataset (National Longitudinal Survey of Young Women, 1988)
library(tidyverse)
df <- haven::read_dta("https://www.stata-press.com/data/r18/nlsw88.dta")
# Rename columns to match this module's examples
df <- df %>% rename(income = wage, education = grade,
experience = ttl_exp)
cat("Loaded", nrow(df), "observations and", ncol(df), "variables\n")
head(df)
# Python: Descriptive statistics
import pandas as pd
df.describe() # Summary stats
df['income'].mean() # Mean
df['income'].median() # Median
df['income'].std() # Standard deviation
df['income'].quantile([0.25, 0.75]) # Percentiles
# By group
df.groupby('education')['income'].mean()
* Stata: Descriptive statistics
summarize income
summarize income, detail
tabstat income, by(education) stat(mean sd n)
# R: Descriptive statistics
summary(df$income)
mean(df$income, na.rm = TRUE)
sd(df$income, na.rm = TRUE)
# By group with dplyr
df %>% group_by(education) %>% summarize(mean_inc = mean(income))
5.4 Data Visualization
Histograms
# Python: Histogram
import matplotlib.pyplot as plt
plt.hist(df['income'], bins=30, edgecolor='white')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.title('Distribution of Income')
plt.savefig('histogram.png')
* Stata: Histogram
histogram income, frequency ///
title("Distribution of Income")
graph export "histogram.png", replace
# R: Histogram with ggplot2
ggplot(df, aes(x = income)) +
geom_histogram(bins = 30, fill = "steelblue") +
labs(title = "Distribution of Income")
ggsave("histogram.png")
Generated figures:
Scatter Plots
# Python: Scatter plot
import seaborn as sns
sns.scatterplot(data=df, x='gdpPercap', y='lifeExp', hue='continent')
plt.xscale('log')
plt.savefig('scatter.png')
* Stata: Scatter plot
twoway (scatter lifeExp gdpPercap), xscale(log)
# R: Scatter plot
ggplot(df, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10()
Generated figures:
5.5 Correlation Analysis
# Python: Correlation
df[['income', 'education', 'age']].corr()
# With p-values
from scipy import stats
r, p = stats.pearsonr(df['income'], df['education'])
* Stata: Correlation
correlate income education age
pwcorr income education age, sig star(.05)
# R: Correlation
cor(df[c("income", "education", "age")], use = "complete.obs")
cor.test(df$income, df$education)
5.6 Basic Regression Analysis
Regression analysis is the foundation of quantitative empirical research. This section covers how to run regressions and—critically—how to interpret what the output means. Understanding these basics is essential before moving to causal inference methods in Module 6.
Running a Basic Regression
# Python: OLS Regression with statsmodels
import statsmodels.formula.api as smf
# Simple regression: income on education
model = smf.ols('income ~ education', data=df).fit()
# Multiple regression: add controls
model = smf.ols('income ~ education + age + experience', data=df).fit()
# View results
print(model.summary())
* Stata: OLS Regression
* Simple regression
reg income education
* Multiple regression with controls
reg income education age experience
* Robust standard errors (heteroskedasticity-robust)
reg income education age experience, robust
* Clustered standard errors
reg income education age experience, cluster(state)
# R: OLS Regression
# Simple regression
model <- lm(income ~ education, data = df)
# Multiple regression with controls
model <- lm(income ~ education + age + experience, data = df)
# View results
summary(model)
# Robust standard errors (using sandwich package)
library(sandwich)
library(lmtest)
coeftest(model, vcov = vcovHC(model, type = "HC1"))
Understanding Regression Output
A regression output contains many statistics. Here's what each one means and how to interpret it:
============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -0.523 1.423 -0.368 0.713 -3.314 2.267 education 0.624 0.047 13.283 0.000 0.532 0.717 age -0.009 0.037 -0.230 0.818 -0.081 0.064 ============================================================================== R-squared: 0.146 F-statistic: 191.3 Prob (F-statistic): 2.45e-77 ==============================================================================
Key Components Explained
| Component | What It Tells You | Example Interpretation |
|---|---|---|
| Coefficient (coef) | The estimated effect of a one-unit change in X on Y | One more year of education is associated with $0.62 higher hourly wage |
| Standard Error (std err) | Uncertainty in the coefficient estimate | Our estimate of education's effect could vary by about $0.05 |
| t-statistic (t) | Coefficient divided by standard error; tests if coef differs from 0 | 13.28 is very large, indicating strong evidence of an effect |
| P-value (P>|t|) | Probability of seeing this result if true effect were zero | 0.000 means extremely unlikely to see this by chance |
| Confidence Interval | Range that contains the true coefficient 95% of the time | True education effect is likely between $0.53 and $0.72 per hour |
| R-squared | Proportion of variance in Y explained by the model | Education and age explain 14.6% of wage variation |
| F-statistic | Tests if all coefficients together are zero | High F (191.3) with low p-value means model explains something |
A regression coefficient measures association, not causation. The education coefficient tells us that higher education is associated with higher income—not that education causes higher income. People who get more education may differ in other ways (ability, motivation, family background) that also affect income. To make causal claims, you need the methods covered in Module 6: Causal Inference.
Statistical Significance
A coefficient is "statistically significant" when we can be confident it differs from zero. Common thresholds:
- p < 0.01 (***): Very strong evidence
- p < 0.05 (**): Strong evidence (most common threshold)
- p < 0.10 (*): Weak evidence
However, statistical significance tells you nothing about practical significance. A coefficient can be statistically significant but economically tiny—or statistically insignificant but potentially meaningful with more data.
Extracting OLS Components: SSR, ESS, TSS, R²
Understanding the decomposition of variance in regression is fundamental. The Total Sum of Squares (TSS) measures total variation in the outcome. OLS splits this into the Explained Sum of Squares (ESS)—variation captured by the model—and the Sum of Squared Residuals (SSR)—unexplained variation. The identity TSS = ESS + SSR always holds for OLS with an intercept.
# Python: Extracting SSR, ESS, TSS, R-squared
import statsmodels.formula.api as smf
import numpy as np
model = smf.ols('income ~ education + experience', data=df).fit()
# Direct access from model object
print(f"SSR (Residual SS): {model.ssr:.4f}")
print(f"ESS (Model SS): {model.ess:.4f}")
print(f"TSS (Total SS): {model.centered_tss:.4f}")
print(f"R-squared: {model.rsquared:.4f}")
print(f"Adj. R-squared: {model.rsquared_adj:.4f}")
# Manual calculation (for understanding)
y = df['income']
y_hat = model.fittedvalues
e = model.resid
SSR = np.sum(e**2)
TSS = np.sum((y - y.mean())**2)
ESS = np.sum((y_hat - y.mean())**2)
R2 = 1 - SSR / TSS
# Verify the identity: TSS = ESS + SSR
print(f"TSS = {TSS:.4f}, ESS + SSR = {ESS + SSR:.4f}")
# Adjusted R-squared
n = len(y)
k = model.df_model # number of regressors
adj_R2 = 1 - (1 - R2) * (n - 1) / (n - k - 1)
* Stata: Extracting SSR, ESS, TSS, R-squared
reg income education experience
* All stored results accessible via ereturn list
ereturn list
* Key scalars:
display "SSR (Residual SS): " e(rss)
display "ESS (Model SS): " e(mss)
display "R-squared: " e(r2)
display "Adj R-squared: " e(r2_a)
display "N observations: " e(N)
display "Degrees of freedom: " e(df_r)
* TSS = ESS + SSR
display "TSS = " e(mss) + e(rss)
* Manual calculation from residuals
predict residuals, residuals
predict yhat, xb
quietly summarize income
gen dev_sq = (income - r(mean))^2
gen res_sq = residuals^2
gen fit_sq = (yhat - r(mean))^2
quietly summarize dev_sq
display "TSS = " r(sum)
quietly summarize res_sq
display "SSR = " r(sum)
# R: Extracting SSR, ESS, TSS, R-squared
model <- lm(income ~ education + experience, data = df)
# Extract key objects
e_hat <- residuals(model) # residual vector
y_hat <- fitted(model) # predicted values
y <- df$income # actual outcome
# Manual calculations
SSR <- sum(e_hat^2) # Sum of Squared Residuals
TSS <- sum((y - mean(y))^2) # Total Sum of Squares
ESS <- sum((y_hat - mean(y))^2) # Explained Sum of Squares
# Verify: TSS = ESS + SSR
cat("TSS =", TSS, " ESS + SSR =", ESS + SSR, "\n")
# R-squared and adjusted R-squared
R2 <- 1 - SSR / TSS
n <- nrow(df)
k <- length(coef(model)) - 1 # number of regressors (excl. intercept)
adj_R2 <- 1 - (1 - R2) * (n - 1) / (n - k - 1)
# Compare with built-in
cat("Manual R2:", R2, " summary R2:", summary(model)$r.squared, "\n")
cat("Manual adj-R2:", adj_R2, " summary adj-R2:", summary(model)$adj.r.squared, "\n")
# Alternative: deviance() gives SSR directly
deviance(model) # same as SSR
TSS = ESS + SSR (with an intercept). R² = ESS/TSS = 1 - SSR/TSS tells you the fraction of total variance explained by the model. Adjusted R² penalizes for adding regressors: it can decrease if a new variable adds more noise than explanatory power.
In practice, you rarely compute these manually—but understanding the decomposition helps you interpret regression output, compare models, and verify that your estimation code is correct.
F-Tests and Joint Significance
An F-test evaluates whether a set of coefficients are jointly significantly different from zero (or from each other). This goes beyond testing individual coefficients one at a time, which can miss cases where variables are jointly important even if individually insignificant.
# Python: F-tests and joint significance
import statsmodels.formula.api as smf
import numpy as np
model = smf.ols('income ~ education + experience + age', data=df).fit()
# Joint test: education AND experience = 0
f_result = model.f_test('education = 0, experience = 0')
print(f_result)
# Test a linear restriction: education = experience
f_restrict = model.f_test('education = experience')
print(f_restrict)
# Overall F-test (all regressors jointly = 0)
print(f"Overall F-statistic: {model.fvalue:.4f}")
print(f"Overall F p-value: {model.f_pvalue:.6f}")
* Stata: F-tests and joint significance
reg income education experience age
* Joint test: education AND experience = 0
test education experience
* Test a linear restriction: education = experience
test education = experience
* Multiple restrictions at once
test (education = 0) (experience = 0) (age = 0)
* Wald test (equivalent for linear models)
testparm education experience
# R: F-tests and joint significance
library(car)
model <- lm(income ~ education + experience + age, data = df)
# Joint test: education AND experience = 0
linearHypothesis(model, c("education = 0", "experience = 0"))
# Test a linear restriction: education = experience
linearHypothesis(model, "education = experience")
# With robust standard errors
library(sandwich)
linearHypothesis(model, c("education = 0", "experience = 0"),
vcov = vcovHC(model, type = "HC1"))
# Overall F-test (from summary)
f_stat <- summary(model)$fstatistic
pf(f_stat[1], f_stat[2], f_stat[3], lower.tail = FALSE)
Use F-tests when you want to test whether a group of variables matters jointly, test restrictions like "does education have the same effect as experience?", or compare nested models. With heteroskedasticity or clustering, always use the robust version (pass a vcov matrix in R, or use , robust in Stata).
Common Regression Specifications
# Python: Common specifications
import statsmodels.formula.api as smf
import numpy as np
# Log transformation
df['log_income'] = np.log(df['income'])
model = smf.ols('log_income ~ education + age', data=df).fit()
# Quadratic term (non-linear relationship)
model = smf.ols('income ~ age + I(age**2)', data=df).fit()
# Interaction terms
model = smf.ols('income ~ education * gender', data=df).fit()
# Categorical variables (dummy encoding)
model = smf.ols('income ~ education + C(region)', data=df).fit()
* Stata: Common specifications
* Log transformation
gen log_income = ln(income)
reg log_income education age
* Quadratic term
reg income c.age##c.age
* Interaction terms
reg income c.education##i.female
* Categorical variables
reg income education i.region
* Fixed effects
areg income education age, absorb(state)
# R: Common specifications
# Log transformation
model <- lm(log(income) ~ education + age, data = df)
# Quadratic term
model <- lm(income ~ age + I(age^2), data = df)
# Interaction terms
model <- lm(income ~ education * gender, data = df)
# Categorical variables
model <- lm(income ~ education + factor(region), data = df)
# Fixed effects (using fixest package)
library(fixest)
model <- feols(income ~ education + age | state, data = df)
Interpreting Log Transformations
| Model | Form | Interpretation of b1 |
|---|---|---|
| Linear | Y = b0 + b1*X | 1 unit increase in X → b1 unit change in Y |
| Log-Linear | log(Y) = b0 + b1*X | 1 unit increase in X → (b1 × 100)% change in Y |
| Linear-Log | Y = b0 + b1*log(X) | 1% increase in X → b1/100 unit change in Y |
| Log-Log | log(Y) = b0 + b1*log(X) | 1% increase in X → b1% change in Y (elasticity) |
Choosing Control Variables: Avoiding Post-Treatment Bias
A common regression mistake is controlling for variables that are themselves affected by the treatment. These are called post-treatment variables (or "bad controls"). Including them blocks part of the causal path and biases the estimate of the treatment effect.
Suppose you study the effect of a job training program (assigned in 2020) on wages in 2021. You should control for pre-treatment variables like age and education (measured before 2020). But you should not control for job title in 2021 — if training changes someone's job title, which then changes their wage, controlling for job title blocks the indirect effect and underestimates the true treatment impact.
Rule of thumb: Only control for variables that were determined before treatment was assigned. If a variable could have been affected by the treatment, leave it out. This applies to all regression-based methods: OLS, matching, DiD, and IV.
5.7 Summary Tables for Papers
# Python: Summary table with stargazer
# pip install stargazer
from stargazer.stargazer import Stargazer
# For regression tables
import statsmodels.formula.api as smf
model = smf.ols('income ~ education + age', data=df).fit()
stargazer = Stargazer([model])
print(stargazer.render_latex())
* Stata: Summary tables with estout
eststo clear
eststo: reg income education
eststo: reg income education age
esttab using "table.tex", replace se star(* 0.1 ** 0.05 *** 0.01)
# R: Summary tables with modelsummary
library(modelsummary)
model1 <- lm(income ~ education, data = df)
model2 <- lm(income ~ education + age, data = df)
modelsummary(list(model1, model2), stars = TRUE, output = "table.tex")
5.8 Exercises
Exercise 5.1: Descriptive Statistics and Visualization
Practice computing statistics and creating visualizations. Complete these tasks:
- Calculate descriptive statistics (mean, median, std)
- Create a histogram
- Create a scatter plot