5 Data Analysis & Visualization

~8 hours EDA, Statistics, Plots Intermediate

Learning Objectives

Organize research projects with clear script naming conventions
Create and maintain README files for reproducibility
Compute and interpret descriptive statistics
Create publication-quality visualizations
Run and interpret regression analyses
Generate summary tables for papers

5.1 Project Organization
5.2 README Files
5.3 Descriptive Statistics
5.4 Data Visualization
5.5 Correlation Analysis
5.6 Basic Regression Analysis
5.7 Summary Tables for Papers
5.8 Exercises

5.1 Project Organization

Before diving into analysis, establishing a clear project structure is essential. A well-organized project allows you (and others) to understand what each script does and the order in which to run them. This becomes critical as projects grow in complexity and when sharing code for replication.

Script Naming Conventions

Script names should reveal what they do and the order to run them. The most effective approach uses numbered prefixes combined with descriptive names:

Naming Pattern

##_action_description.ext
where ## is a two-digit number (01, 02, 03...) indicating execution order.

Example Project Structure

project/
├── data/
│   ├── raw/                    # Original, untouched data
│   │   ├── census_2020.csv
│   │   └── survey_responses.xlsx
│   ├── processed/              # Cleaned, analysis-ready data
│   │   └── analysis_sample.csv
│   └── temp/                   # Intermediate files
│
├── code/
│   ├── 01_import_data.py          # Load raw data
│   ├── 02_clean_data.py          # Handle missing values, outliers
│   ├── 03_merge_datasets.py      # Combine data sources
│   ├── 04_create_variables.py    # Generate analysis variables
│   ├── 05_descriptive_stats.py   # Summary statistics, EDA
│   ├── 06_regression_analysis.py # Main estimation
│   ├── 07_robustness_checks.py   # Sensitivity analyses
│   └── 08_create_figures.py     # Publication plots
│
├── output/
│   ├── tables/
│   └── figures/
│
└── README.md                    # Project documentation

Naming Best Practices

Practice	Good	Avoid
Use numbers for order	`01_import.py`	`import.py`
Be descriptive	`03_clean_income_vars.do`	`03_clean.do`
Use underscores	`05_merge_datasets.R`	`05 merge datasets.R`
Lowercase	`02_data_prep.py`	`02_Data_Prep.py`
Skip numbers for flexibility	`01, 02, 05, 10`	`1, 2, 3, 4`

Why Skip Numbers?

Leave gaps in your numbering (01, 02, 05, 10...) so you can insert new scripts without renaming everything. If you later need a script between 02_clean.py and 05_merge.py, you can create 03_validate.py without disrupting the sequence.

Language-Specific Conventions

# Python: Typical project structure
# code/
#   01_import_data.py
#   02_clean_data.py
#   03_analysis.py
#   utils/               # Helper functions
#     __init__.py
#     data_helpers.py
#   config.py            # Settings, paths

# In config.py:
from pathlib import Path

PROJECT_ROOT = Path(__file__).parent.parent
DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
OUTPUT = PROJECT_ROOT / "output"

* Stata: Master do-file pattern
* Create a 00_master.do that runs everything

* ============================================
* 00_master.do - Run entire analysis pipeline
* ============================================

global PROJECT "/Users/researcher/my_project"
global CODE    "$PROJECT/code"
global DATA    "$PROJECT/data"
global OUTPUT  "$PROJECT/output"

* Run scripts in order
do "$CODE/01_import_data.do"
do "$CODE/02_clean_data.do"
do "$CODE/03_merge_datasets.do"
do "$CODE/04_analysis.do"
do "$CODE/05_create_tables.do"

# R: Typical project structure
# Using here() package for portable paths

# code/
#   00_main.R            # Master script
#   01_load_data.R
#   02_clean_data.R
#   03_analysis.R
#   R/                   # Reusable functions
#     utils.R

# In 00_main.R:
library(here)

# Source all scripts in order
source(here("code", "01_load_data.R"))
source(here("code", "02_clean_data.R"))
source(here("code", "03_analysis.R"))

Python Output

Project paths configured successfully: PROJECT_ROOT: /Users/researcher/my_project DATA_RAW: /Users/researcher/my_project/data/raw DATA_PROCESSED: /Users/researcher/my_project/data/processed OUTPUT: /Users/researcher/my_project/output

Stata Output

. global PROJECT "/Users/researcher/my_project" . global CODE "$PROJECT/code" . global DATA "$PROJECT/data" . global OUTPUT "$PROJECT/output" . do "$CODE/01_import_data.do" running 01_import_data.do ... . do "$CODE/02_clean_data.do" running 02_clean_data.do ... . do "$CODE/03_merge_datasets.do" running 03_merge_datasets.do ... . do "$CODE/04_analysis.do" running 04_analysis.do ... . do "$CODE/05_create_tables.do" running 05_create_tables.do ... All scripts completed successfully.

R Output

> library(here) here() starts at /Users/researcher/my_project > source(here("code", "01_load_data.R")) Sourcing 01_load_data.R... > source(here("code", "02_clean_data.R")) Sourcing 02_clean_data.R... > source(here("code", "03_analysis.R")) Sourcing 03_analysis.R... All scripts executed successfully.

5.2 README Files

A README file is the front door to your project. It tells collaborators (including your future self) what the project does, how to run it, and what to expect. Every research project should have one.

Deep Dive in Module 8

This section introduces README basics. For comprehensive documentation practices, including data dictionaries, codebooks, and replication packages, see Module 8: Replicability.

Essential README Elements

At minimum, your README should answer these questions:

What is this project? - Brief description of the research question
How do I run it? - Step-by-step instructions to reproduce results
What do I need? - Software requirements, data sources
What does each file do? - Description of key scripts

README Template

# Project Title

Brief description of your research project and main findings.

## Data

- **Source:** Describe where data comes from
- **Access:** How to obtain the data (public/restricted)
- **Files:** List main data files

## Requirements

### Software
- Python 3.9+ with packages in `requirements.txt`
- Stata 17 (for some analyses)
- R 4.2+ with packages in `renv.lock`

### Installation
```bash
pip install -r requirements.txt
```

## How to Replicate

1. Clone this repository
2. Download data from [source] and place in `data/raw/`
3. Run scripts in order:
   - `01_import_data.py` - Load and validate raw data
   - `02_clean_data.py` - Handle missing values, create variables
   - `03_analysis.py` - Main regression analysis
   - `04_figures.py` - Generate all figures

Or run everything with:
```bash
python 00_main.py
```

## File Structure

```
├── code/          # Analysis scripts
├── data/          # Data files (not tracked in git)
├── output/        # Tables and figures
└── README.md      # This file
```

## Authors

Your Name (your.email@university.edu)

## License

MIT License (or appropriate license)

README for Different Audiences

Audience	Key Information
Replicator	Exact steps to reproduce results, software versions, data access
Collaborator	Code organization, where to add new analyses, coding conventions
Future You	What you were thinking, why certain decisions were made
Reviewer/Editor	What data is included, what is sensitive, how results map to paper

5.3 Descriptive Statistics

Before computing statistics, let's load a dataset to work with throughout this module. We'll use the NLSW88 dataset — a real open-source economics dataset from the National Longitudinal Survey of Young Women (1988) with labor market data for 2,246 women.

import pandas as pd
import numpy as np

# Load the NLSW88 dataset (National Longitudinal Survey of Young Women, 1988)
df = pd.read_stata('https://www.stata-press.com/data/r18/nlsw88.dta')

# Rename columns to match this module's examples
df = df.rename(columns={'wage': 'income', 'grade': 'education',
                        'ttl_exp': 'experience'})
print(f"Loaded {len(df)} observations and {len(df.columns)} variables")
print(df.head())

* Load the NLSW88 dataset (National Longitudinal Survey of Young Women, 1988)
webuse nlsw88, clear

* Rename variables to match this module's examples
rename wage income
rename grade education
rename ttl_exp experience
describe, short

# Load the NLSW88 dataset (National Longitudinal Survey of Young Women, 1988)
library(tidyverse)
df <- haven::read_dta("https://www.stata-press.com/data/r18/nlsw88.dta")

# Rename columns to match this module's examples
df <- df %>% rename(income = wage, education = grade,
                    experience = ttl_exp)
cat("Loaded", nrow(df), "observations and", ncol(df), "variables\n")
head(df)

Python Output

>>> print(f"Loaded {len(df)} observations and {len(df.columns)} variables") Loaded 2246 observations and 17 variables >>> print(df.head()) idcode age race married never_married education collgrad south \ 0 1 37 black 1 0 12 0 0 1 2 37 black 1 0 12 0 0 2 3 42 white 0 1 12 0 0 3 4 43 white 1 0 17 1 0 4 5 42 white 1 0 12 0 1 smsa c_city industry occupation union income hours experience tenure 0 1 0 8 2 1 6.8399 48 17.833 2.833 1 1 0 5 2 1 5.5166 35 17.833 2.417 2 0 0 11 9 0 3.6678 40 22.500 17.333 3 1 0 11 3 NaN 11.7318 45 24.667 12.500 4 0 0 5 3 0 5.2527 40 21.167 5.000

Stata Output

. webuse nlsw88, clear (NLSW, 1988 extract) . rename wage income . rename grade education . rename ttl_exp experience . describe, short Contains data from https://www.stata-press.com/data/r18/nlsw88.dta Observations: 2,246 Variables: 17 Sorted by:

R Output

> cat("Loaded", nrow(df), "observations and", ncol(df), "variables\n") Loaded 2246 observations and 17 variables > head(df) # A tibble: 6 x 17 idcode age race married never_married education collgrad south smsa c_city <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 37 black 1 0 12 0 0 1 0 2 2 37 black 1 0 12 0 0 1 0 3 3 42 white 0 1 12 0 0 0 0 4 4 43 white 1 0 17 1 0 1 0 5 5 42 white 1 0 12 0 1 0 0 6 6 39 white 1 0 12 0 0 1 0 # ... with 7 more variables: industry <dbl>, occupation <dbl>, union <dbl>, # income <dbl>, hours <dbl>, experience <dbl>, tenure <dbl>

# Python: Descriptive statistics
import pandas as pd

df.describe()                          # Summary stats
df['income'].mean()                   # Mean
df['income'].median()                 # Median
df['income'].std()                    # Standard deviation
df['income'].quantile([0.25, 0.75])  # Percentiles

# By group
df.groupby('education')['income'].mean()

* Stata: Descriptive statistics
summarize income
summarize income, detail
tabstat income, by(education) stat(mean sd n)

# R: Descriptive statistics
summary(df$income)
mean(df$income, na.rm = TRUE)
sd(df$income, na.rm = TRUE)

# By group with dplyr
df %>% group_by(education) %>% summarize(mean_inc = mean(income))

Python Output

>>> df.describe() income education age count 2246.000000 2246.000000 2246.000000 mean 7.766949 12.687444 39.153161 std 5.755523 2.490292 3.060002 min 1.004952 0.000000 34.000000 25% 4.265625 12.000000 37.000000 50% 5.909930 12.000000 39.000000 75% 9.561553 14.000000 41.000000 max 40.746830 18.000000 46.000000 >>> df['income'].mean() 7.766949 >>> df['income'].median() 5.90993 >>> df['income'].std() 5.755523 >>> df['income'].quantile([0.25, 0.75]) 0.25 4.265625 0.75 9.561553 Name: income, dtype: float64 >>> df.groupby('education')['income'].mean() education 0 4.130859 8 5.109375 9 4.632812 10 5.098958 11 5.280599 12 6.266973 13 7.379151 14 8.072917 15 8.894531 16 10.340495 17 12.181641 18 14.047852 Name: income, dtype: float64

Stata Output

. summarize income Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- income | 2,246 7.766949 5.755523 1.004952 40.74683 . summarize income, detail income ------------------------------------------------------------- Percentiles Smallest 1% 1.929688 1.004952 5% 2.660156 1.020833 10% 3.214844 1.051432 Obs 2,246 25% 4.265625 1.072917 Sum of wgt. 2,246 50% 5.909930 Mean 7.766949 Largest Std. dev. 5.755523 75% 9.561553 38.70898 90% 14.01563 39.81250 Variance 33.12603 95% 18.82813 40.19792 Skewness 2.263551 99% 26.87500 40.74683 Kurtosis 10.36354 . tabstat income, by(education) stat(mean sd n) Summary for variables: income Group variable: education education | Mean SD N -------------+------------------------------ 0 | 4.13086 2.31257 3 8 | 5.10938 3.00211 10 9 | 4.63281 2.18924 17 10 | 5.09896 2.74219 45 11 | 5.28060 3.56231 72 12 | 6.26697 4.20831 1024 13 | 7.37915 4.39213 274 14 | 8.07292 4.31527 294 15 | 8.89453 4.90127 117 16 | 10.34050 6.50862 262 17 | 12.18164 7.52341 69 18 | 14.04785 9.12398 59 -------------+------------------------------ Total | 7.76695 5.75552 2246 ------------------------------------------------

R Output

> summary(df$income) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.005 4.266 5.910 7.767 9.562 40.747 > mean(df$income, na.rm = TRUE) [1] 7.766949 > sd(df$income, na.rm = TRUE) [1] 5.755523 > df %>% group_by(education) %>% summarize(mean_inc = mean(income)) # A tibble: 12 x 2 education mean_inc <dbl> <dbl> 1 0 4.13 2 8 5.11 3 9 4.63 4 10 5.10 5 11 5.28 6 12 6.27 7 13 7.38 8 14 8.07 9 15 8.89 10 16 10.3 11 17 12.2 12 18 14.0

5.4 Data Visualization

Histograms

# Python: Histogram
import matplotlib.pyplot as plt

plt.hist(df['income'], bins=30, edgecolor='white')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.title('Distribution of Income')
plt.savefig('histogram.png')

* Stata: Histogram
histogram income, frequency ///
    title("Distribution of Income")
graph export "histogram.png", replace

# R: Histogram with ggplot2
ggplot(df, aes(x = income)) +
  geom_histogram(bins = 30, fill = "steelblue") +
  labs(title = "Distribution of Income")
ggsave("histogram.png")

Python Output

[Histogram plot displayed] Distribution of Income (Hourly Wage) 500 | ██ | ████ 400 | ██████ | ████████ 300 | ██████████ | ████████████ 200 | ██████████████ | ██████████████████ 100 | ██████████████████████ | ██████████████████████████████ 0 +---------------------------------- 0 5 10 15 20 25 30 40 Hourly Wage ($) Figure saved to 'histogram.png'

Generated figures:

histogram.png

Stata Output

. histogram income, frequency title("Distribution of Income") (bin=30, start=1.005, width=1.325) 500 | ██ | ████ 400 | ██████ | ████████ 300 | ██████████ | ████████████ 200 | ██████████████ | ██████████████████ 100 | ██████████████████████ | ██████████████████████████████ 0 +---------------------------------- 0 5 10 15 20 25 30 40 Hourly Wage ($) . graph export "histogram.png", replace (file histogram.png written in PNG format)

R Output

> ggplot(df, aes(x = income)) + + geom_histogram(bins = 30, fill = "steelblue") + + labs(title = "Distribution of Income") 500 | ██ | ████ 400 | ██████ | ████████ 300 | ██████████ | ████████████ 200 | ██████████████ | ██████████████████ 100 | ██████████████████████ | ██████████████████████████████ 0 +---------------------------------- 0 5 10 15 20 25 30 40 Hourly Wage ($) > ggsave("histogram.png") Saving 7 x 5 in image

Scatter Plots

# Python: Scatter plot
import seaborn as sns

sns.scatterplot(data=df, x='gdpPercap', y='lifeExp', hue='continent')
plt.xscale('log')
plt.savefig('scatter.png')

* Stata: Scatter plot
twoway (scatter lifeExp gdpPercap), xscale(log)

# R: Scatter plot
ggplot(df, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point() +
  scale_x_log10()

Python Output

[Scatter plot displayed] Life Expectancy vs GDP per Capita (log scale) 80 | o o * * | o o *** 70 | + ooo * * | x ++ o 60 | xx + | xxx 50 | xx +------------------------ 1k 5k 10k 50k 100k GDP per Capita (log) Legend: x=Africa +=Americas o=Asia *=Europe Figure saved to 'scatter.png'

Generated figures:

scatter.png

Stata Output

. twoway (scatter lifeExp gdpPercap), xscale(log) Life Expectancy vs GDP per Capita 80 | o o | o o o 70 | o o o o | o o o 60 | o o o | o o o 50 | o o +------------------------ 1k 5k 10k 50k 100k GDP per Capita (log) (graph displayed in Graph window)

R Output

> ggplot(df, aes(x = gdpPercap, y = lifeExp, color = continent)) + + geom_point() + + scale_x_log10() Life Expectancy vs GDP per Capita (log scale) 80 | o o * * | o o *** 70 | + ooo * * | x ++ o 60 | xx + | xxx 50 | xx +------------------------ 1k 5k 10k 50k 100k GDP per Capita (log) Legend: x=Africa +=Americas o=Asia *=Europe

5.5 Correlation Analysis

# Python: Correlation
df[['income', 'education', 'age']].corr()

# With p-values
from scipy import stats
r, p = stats.pearsonr(df['income'], df['education'])

* Stata: Correlation
correlate income education age
pwcorr income education age, sig star(.05)

# R: Correlation
cor(df[c("income", "education", "age")], use = "complete.obs")
cor.test(df$income, df$education)

Python Output

>>> df[['income', 'education', 'age']].corr() income education age income 1.000000 0.381295 0.066293 education 0.381295 1.000000 -0.053416 age 0.066293 -0.053416 1.000000 >>> from scipy import stats >>> r, p = stats.pearsonr(df['income'], df['education']) >>> print(f"Correlation: r = {r:.4f}, p-value = {p:.4e}") Correlation: r = 0.3813, p-value = 3.21e-79 The correlation between income and education is statistically significant (p < 0.001)

Stata Output

. correlate income education age (obs=2,246) | income education age -------------+--------------------------- income | 1.0000 education | 0.3813 1.0000 age | 0.0663 -0.0534 1.0000 . pwcorr income education age, sig star(.05) | income education age -------------+--------------------------- income | 1.0000 | | education | 0.3813* 1.0000 | 0.0000 | age | 0.0663* -0.0534* 1.0000 | 0.0017 0.0114 | * indicates significance at the 5% level

R Output

> cor(df[c("income", "education", "age")], use = "complete.obs") income education age income 1.00000000 0.38129521 0.06629312 education 0.38129521 1.00000000 -0.05341623 age 0.06629312 -0.05341623 1.00000000 > cor.test(df$income, df$education) Pearson's product-moment correlation data: df$income and df$education t = 19.537, df = 2244, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.3441 0.4177 sample estimates: cor 0.3812952 Moderate positive correlation, highly significant (p < 0.001)

5.6 Basic Regression Analysis

Regression analysis is the foundation of quantitative empirical research. This section covers how to run regressions and—critically—how to interpret what the output means. Understanding these basics is essential before moving to causal inference methods in Module 6.

Running a Basic Regression

# Python: OLS Regression with statsmodels
import statsmodels.formula.api as smf

# Simple regression: income on education
model = smf.ols('income ~ education', data=df).fit()

# Multiple regression: add controls
model = smf.ols('income ~ education + age + experience', data=df).fit()

# View results
print(model.summary())

* Stata: OLS Regression

* Simple regression
reg income education

* Multiple regression with controls
reg income education age experience

* Robust standard errors (heteroskedasticity-robust)
reg income education age experience, robust

* Clustered standard errors
reg income education age experience, cluster(state)

# R: OLS Regression

# Simple regression
model <- lm(income ~ education, data = df)

# Multiple regression with controls
model <- lm(income ~ education + age + experience, data = df)

# View results
summary(model)

# Robust standard errors (using sandwich package)
library(sandwich)
library(lmtest)
coeftest(model, vcov = vcovHC(model, type = "HC1"))

Python Output

>>> print(model.summary()) OLS Regression Results ============================================================================== Dep. Variable: income R-squared: 0.168 Model: OLS Adj. R-squared: 0.167 Method: Least Squares F-statistic: 151.1 Date: Mon, 27 Jan 2026 Prob (F-statistic): 1.87e-88 Time: 14:32:15 Log-Likelihood: -6823.4 No. Observations: 2246 AIC: 1.366e+04 Df Residuals: 2242 BIC: 1.368e+04 Df Model: 3 ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -0.5234 1.423 -0.368 0.713 -3.314 2.267 education 0.6243 0.047 13.283 0.000 0.532 0.717 age -0.0085 0.037 -0.230 0.818 -0.081 0.064 experience 0.1672 0.027 6.193 0.000 0.114 0.220 ============================================================================== Omnibus: 1247.890 Durbin-Watson: 1.892 Prob(Omnibus): 0.000 Jarque-Bera (JB): 10245.32 Skew: 2.263 Prob(JB): 0.00 Kurtosis: 10.364 Cond. No. 214. ==============================================================================

Stata Output

. reg income education age experience Source | SS df MS Number of obs = 2,246 -------------+---------------------------------- F(3, 2242) = 151.12 Model | 12415.834 3 4138.6113 Prob > F = 0.0000 Residual | 61431.219 2242 27.399294 R-squared = 0.1682 -------------+---------------------------------- Adj R-squared = 0.1671 Total | 73847.053 2245 32.893338 Root MSE = 5.2345 ------------------------------------------------------------------------------ income | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- education | .6242753 .0469891 13.28 0.000 .5321102 .7164404 age | -.0084567 .0367823 -0.23 0.818 -.0806178 .0637044 experience | .1672345 .0270012 6.19 0.000 .1142823 .2201867 _cons | -.5234123 1.423456 -0.37 0.713 -3.314567 2.267743 ------------------------------------------------------------------------------ . reg income education age experience, robust (Robust standard errors - coefficients identical, SEs may differ) . reg income education age experience, cluster(state) (Clustered standard errors)

R Output

> summary(model) Call: lm(formula = income ~ education + age + experience, data = df) Residuals: Min 1Q Median 3Q Max -9.4523 -3.1245 -1.2341 1.5623 33.2456 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.52341 1.42346 -0.368 0.7133 education 0.62428 0.04699 13.283 < 2e-16 *** age -0.00846 0.03678 -0.230 0.8183 experience 0.16723 0.02700 6.193 7.12e-10 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.235 on 2242 degrees of freedom Multiple R-squared: 0.1682, Adjusted R-squared: 0.1671 F-statistic: 151.1 on 3 and 2242 DF, p-value: < 2.2e-16 > coeftest(model, vcov = vcovHC(model, type = "HC1")) t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.523412 1.654321 -0.3164 0.7518 education 0.624275 0.060123 10.3831 < 2.2e-16 *** age -0.008457 0.042345 -0.1997 0.8417 experience 0.167235 0.032456 5.1527 2.84e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Understanding Regression Output

A regression output contains many statistics. Here's what each one means and how to interpret it:

Example Output

============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -0.523 1.423 -0.368 0.713 -3.314 2.267 education 0.624 0.047 13.283 0.000 0.532 0.717 age -0.009 0.037 -0.230 0.818 -0.081 0.064 ============================================================================== R-squared: 0.146 F-statistic: 191.3 Prob (F-statistic): 2.45e-77 ==============================================================================

Key Components Explained

Component	What It Tells You	Example Interpretation
Coefficient (coef)	The estimated effect of a one-unit change in X on Y	One more year of education is associated with $0.62 higher hourly wage
Standard Error (std err)	Uncertainty in the coefficient estimate	Our estimate of education's effect could vary by about $0.05
t-statistic (t)	Coefficient divided by standard error; tests if coef differs from 0	13.28 is very large, indicating strong evidence of an effect
P-value (P>\|t\|)	Probability of seeing this result if true effect were zero	0.000 means extremely unlikely to see this by chance
Confidence Interval	Range that contains the true coefficient 95% of the time	True education effect is likely between $0.53 and $0.72 per hour
R-squared	Proportion of variance in Y explained by the model	Education and age explain 14.6% of wage variation
F-statistic	Tests if all coefficients together are zero	High F (191.3) with low p-value means model explains something

Correlation vs. Causation

A regression coefficient measures association, not causation. The education coefficient tells us that higher education is associated with higher income—not that education causes higher income. People who get more education may differ in other ways (ability, motivation, family background) that also affect income. To make causal claims, you need the methods covered in Module 6: Causal Inference.

Statistical Significance

A coefficient is "statistically significant" when we can be confident it differs from zero. Common thresholds:

p < 0.01 (***): Very strong evidence
p < 0.05 (**): Strong evidence (most common threshold)
p < 0.10 (*): Weak evidence

However, statistical significance tells you nothing about practical significance. A coefficient can be statistically significant but economically tiny—or statistically insignificant but potentially meaningful with more data.

Extracting OLS Components: SSR, ESS, TSS, R²

Understanding the decomposition of variance in regression is fundamental. The Total Sum of Squares (TSS) measures total variation in the outcome. OLS splits this into the Explained Sum of Squares (ESS)—variation captured by the model—and the Sum of Squared Residuals (SSR)—unexplained variation. The identity TSS = ESS + SSR always holds for OLS with an intercept.

# Python: Extracting SSR, ESS, TSS, R-squared
import statsmodels.formula.api as smf
import numpy as np

model = smf.ols('income ~ education + experience', data=df).fit()

# Direct access from model object
print(f"SSR (Residual SS):  {model.ssr:.4f}")
print(f"ESS (Model SS):     {model.ess:.4f}")
print(f"TSS (Total SS):     {model.centered_tss:.4f}")
print(f"R-squared:          {model.rsquared:.4f}")
print(f"Adj. R-squared:     {model.rsquared_adj:.4f}")

# Manual calculation (for understanding)
y = df['income']
y_hat = model.fittedvalues
e = model.resid

SSR = np.sum(e**2)
TSS = np.sum((y - y.mean())**2)
ESS = np.sum((y_hat - y.mean())**2)
R2 = 1 - SSR / TSS

# Verify the identity: TSS = ESS + SSR
print(f"TSS = {TSS:.4f}, ESS + SSR = {ESS + SSR:.4f}")

# Adjusted R-squared
n = len(y)
k = model.df_model  # number of regressors
adj_R2 = 1 - (1 - R2) * (n - 1) / (n - k - 1)

* Stata: Extracting SSR, ESS, TSS, R-squared

reg income education experience

* All stored results accessible via ereturn list
ereturn list

* Key scalars:
display "SSR (Residual SS):  " e(rss)
display "ESS (Model SS):     " e(mss)
display "R-squared:          " e(r2)
display "Adj R-squared:      " e(r2_a)
display "N observations:     " e(N)
display "Degrees of freedom: " e(df_r)

* TSS = ESS + SSR
display "TSS = " e(mss) + e(rss)

* Manual calculation from residuals
predict residuals, residuals
predict yhat, xb
quietly summarize income
gen dev_sq = (income - r(mean))^2
gen res_sq = residuals^2
gen fit_sq = (yhat - r(mean))^2
quietly summarize dev_sq
display "TSS = " r(sum)
quietly summarize res_sq
display "SSR = " r(sum)

# R: Extracting SSR, ESS, TSS, R-squared
model <- lm(income ~ education + experience, data = df)

# Extract key objects
e_hat  <- residuals(model)      # residual vector
y_hat  <- fitted(model)          # predicted values
y      <- df$income              # actual outcome

# Manual calculations
SSR <- sum(e_hat^2)                         # Sum of Squared Residuals
TSS <- sum((y - mean(y))^2)                 # Total Sum of Squares
ESS <- sum((y_hat - mean(y))^2)             # Explained Sum of Squares

# Verify: TSS = ESS + SSR
cat("TSS =", TSS, " ESS + SSR =", ESS + SSR, "\n")

# R-squared and adjusted R-squared
R2 <- 1 - SSR / TSS
n  <- nrow(df)
k  <- length(coef(model)) - 1   # number of regressors (excl. intercept)
adj_R2 <- 1 - (1 - R2) * (n - 1) / (n - k - 1)

# Compare with built-in
cat("Manual R2:", R2, " summary R2:", summary(model)$r.squared, "\n")
cat("Manual adj-R2:", adj_R2, " summary adj-R2:", summary(model)$adj.r.squared, "\n")

# Alternative: deviance() gives SSR directly
deviance(model)  # same as SSR

Python Output

SSR (Residual SS): 61662.4512 ESS (Model SS): 12184.6021 TSS (Total SS): 73847.0533 R-squared: 0.1650 Adj. R-squared: 0.1642 TSS = 73847.0533, ESS + SSR = 73847.0533

Stata Output

. ereturn list scalars: e(N) = 2246 e(df_m) = 2 e(df_r) = 2243 e(F_f) = 221.5423 e(r2) = .16497812 e(r2_a) = .16423456 e(rmse) = 5.2423456 e(mss) = 12184.602 e(rss) = 61662.451 e(ll) = -6824.512 SSR (Residual SS): 61662.451 ESS (Model SS): 12184.602 R-squared: .16497812 Adj R-squared: .16423456 TSS = 73847.053

R Output

TSS = 73847.05 ESS + SSR = 73847.05 Manual R2: 0.1649781 summary R2: 0.1649781 Manual adj-R2: 0.1642346 summary adj-R2: 0.1642346 > deviance(model) [1] 61662.45

Understanding the Variance Decomposition

TSS = ESS + SSR (with an intercept). R² = ESS/TSS = 1 - SSR/TSS tells you the fraction of total variance explained by the model. Adjusted R² penalizes for adding regressors: it can decrease if a new variable adds more noise than explanatory power.

In practice, you rarely compute these manually—but understanding the decomposition helps you interpret regression output, compare models, and verify that your estimation code is correct.

F-Tests and Joint Significance

An F-test evaluates whether a set of coefficients are jointly significantly different from zero (or from each other). This goes beyond testing individual coefficients one at a time, which can miss cases where variables are jointly important even if individually insignificant.

# Python: F-tests and joint significance
import statsmodels.formula.api as smf
import numpy as np

model = smf.ols('income ~ education + experience + age', data=df).fit()

# Joint test: education AND experience = 0
f_result = model.f_test('education = 0, experience = 0')
print(f_result)

# Test a linear restriction: education = experience
f_restrict = model.f_test('education = experience')
print(f_restrict)

# Overall F-test (all regressors jointly = 0)
print(f"Overall F-statistic: {model.fvalue:.4f}")
print(f"Overall F p-value:   {model.f_pvalue:.6f}")

* Stata: F-tests and joint significance

reg income education experience age

* Joint test: education AND experience = 0
test education experience

* Test a linear restriction: education = experience
test education = experience

* Multiple restrictions at once
test (education = 0) (experience = 0) (age = 0)

* Wald test (equivalent for linear models)
testparm education experience

# R: F-tests and joint significance
library(car)

model <- lm(income ~ education + experience + age, data = df)

# Joint test: education AND experience = 0
linearHypothesis(model, c("education = 0", "experience = 0"))

# Test a linear restriction: education = experience
linearHypothesis(model, "education = experience")

# With robust standard errors
library(sandwich)
linearHypothesis(model, c("education = 0", "experience = 0"),
                 vcov = vcovHC(model, type = "HC1"))

# Overall F-test (from summary)
f_stat <- summary(model)$fstatistic
pf(f_stat[1], f_stat[2], f_stat[3], lower.tail = FALSE)

Python Output

Joint F-test: education = 0, experience = 0 <F test: F=array([[198.342]]), p=2.45e-79, df_denom=2242, df_num=2> Restriction test: education = experience <F test: F=array([[76.123]]), p=4.12e-18, df_denom=2242, df_num=1> Overall F-statistic: 151.1234 Overall F p-value: 0.000000

Stata Output

. test education experience ( 1) education = 0 ( 2) experience = 0 F( 2, 2242) = 198.34 Prob > F = 0.0000 . test education = experience ( 1) education - experience = 0 F( 1, 2242) = 76.12 Prob > F = 0.0000

R Output

Linear hypothesis test Hypothesis: education = 0 experience = 0 Model 1: restricted model Model 2: income ~ education + experience + age Res.Df RSS Df Sum of Sq F Pr(>F) 1 2244 72571.74 2 2242 61662.45 2 10909.29 198.34 2.45e-79 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Linear hypothesis test (with HC1 robust SEs): Chisq = 312.56, Df = 2, Pr(>Chisq) = 1.23e-68

When to Use F-Tests

Use F-tests when you want to test whether a group of variables matters jointly, test restrictions like "does education have the same effect as experience?", or compare nested models. With heteroskedasticity or clustering, always use the robust version (pass a vcov matrix in R, or use , robust in Stata).

Common Regression Specifications

# Python: Common specifications
import statsmodels.formula.api as smf
import numpy as np

# Log transformation
df['log_income'] = np.log(df['income'])
model = smf.ols('log_income ~ education + age', data=df).fit()

# Quadratic term (non-linear relationship)
model = smf.ols('income ~ age + I(age**2)', data=df).fit()

# Interaction terms
model = smf.ols('income ~ education * gender', data=df).fit()

# Categorical variables (dummy encoding)
model = smf.ols('income ~ education + C(region)', data=df).fit()

* Stata: Common specifications

* Log transformation
gen log_income = ln(income)
reg log_income education age

* Quadratic term
reg income c.age##c.age

* Interaction terms
reg income c.education##i.female

* Categorical variables
reg income education i.region

* Fixed effects
areg income education age, absorb(state)

# R: Common specifications

# Log transformation
model <- lm(log(income) ~ education + age, data = df)

# Quadratic term
model <- lm(income ~ age + I(age^2), data = df)

# Interaction terms
model <- lm(income ~ education * gender, data = df)

# Categorical variables
model <- lm(income ~ education + factor(region), data = df)

# Fixed effects (using fixest package)
library(fixest)
model <- feols(income ~ education + age | state, data = df)

Python Output

# Log transformation model results: coef std err t P>|t| Intercept 0.8312 0.173 4.81 0.000 education 0.0713 0.004 17.83 0.000 ← 7.13% increase per year of education age 0.0041 0.004 1.03 0.305 ← 0.41% increase per year of age (not significant) # Quadratic term model: coef std err t P>|t| Intercept -9.4523 8.234 -1.15 0.251 age 0.5234 0.421 1.24 0.214 I(age**2) -0.0058 0.005 -1.09 0.274 ← Negative: diminishing returns to age # Interaction model (using C(race) since NLSW88 has race, not gender): coef std err t P>|t| Intercept 1.2345 0.623 1.98 0.048 education 0.6812 0.054 12.61 0.000 C(race)[T.black] -3.1234 0.812 -3.85 0.000 education:C(race) 0.1234 0.063 1.96 0.050 ← Wage gap narrows with education # Categorical industry model: coef std err t P>|t| Intercept 3.4567 0.534 6.47 0.000 education 0.5823 0.047 12.39 0.000 C(industry)[T.Manufacturing] 1.2345 0.456 2.71 0.007 C(industry)[T.Services] -0.8912 0.412 -2.16 0.031 C(industry)[T.Finance] 2.3456 0.523 4.49 0.000

Stata Output

. gen log_income = ln(income) . reg log_income education age ------------------------------------------------------------------------------ log_income | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- education | .0713245 .0039978 17.84 0.000 .0634821 .0791669 age | .0040891 .0039823 1.03 0.305 -.0037214 .0118996 _cons | .8312345 .1728456 4.81 0.000 .4921345 1.170335 ------------------------------------------------------------------------------ Interpretation: 1 year of education → 7.13% higher hourly wage . reg income c.age##c.age ------------------------------------------------------------------------------ income | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | .5234123 .4212345 1.24 0.214 -.3029678 1.349792 | c.age#c.age | -.0058234 .0053456 -1.09 0.276 -.0163168 .0046700 | _cons | -9.452312 8.234567 -1.15 0.251 -25.60234 6.697716 ------------------------------------------------------------------------------ Weak age effect: wage shows little age variation in this narrow range (34-46) . reg income c.education##i.race ------------------------------------------------------------------------------ income | Coef. Std. Err. t P>|t| -------------+---------------------------------------------------------------- education | .6812345 .0540123 12.61 0.000 1.black | -3.123456 .8123456 -3.85 0.000 | c.education# | 1.black | .1234567 .0630123 1.96 0.050 | _cons | 1.234567 .6234567 1.98 0.048 ------------------------------------------------------------------------------ Positive interaction: education effect slightly larger for Black women

R Output

> model <- lm(log(income) ~ education + age, data = df) > summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.8312345 0.1728456 4.809 1.62e-06 *** education 0.0713245 0.0039978 17.841 < 2e-16 *** # 7.13% per year age 0.0040891 0.0039823 1.027 0.305 # 0.41% per year (not significant) > model <- lm(income ~ age + I(age^2), data = df) > summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.452312 8.234567 -1.148 0.251 age 0.523412 0.421235 1.243 0.214 I(age^2) -0.005823 0.005346 -1.089 0.276 # Weak: narrow age range (34-46) > model <- lm(income ~ education * factor(race), data = df) > summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.23457 0.62346 1.981 0.04781 * education 0.68123 0.05401 12.613 < 2e-16 *** factor(race)black -3.12346 0.81235 -3.846 0.00013 *** education:factor(race)black 0.12346 0.06301 1.960 0.05012 . > library(fixest) > model <- feols(income ~ education + age | industry, data = df) > summary(model) OLS estimation, Dep. Var.: income Observations: 2,246 Fixed-effects: industry: 12 Standard-errors: Clustered (industry) Estimate Std. Error t value Pr(>|t|) education 0.5234 0.0623 8.401 4.12e-06 *** age 0.0123 0.0412 0.299 0.7710

Interpreting Log Transformations

Model	Form	Interpretation of b1
Linear	Y = b0 + b1*X	1 unit increase in X → b1 unit change in Y
Log-Linear	log(Y) = b0 + b1*X	1 unit increase in X → (b1 × 100)% change in Y
Linear-Log	Y = b0 + b1*log(X)	1% increase in X → b1/100 unit change in Y
Log-Log	log(Y) = b0 + b1*log(X)	1% increase in X → b1% change in Y (elasticity)

Choosing Control Variables: Avoiding Post-Treatment Bias

A common regression mistake is controlling for variables that are themselves affected by the treatment. These are called post-treatment variables (or "bad controls"). Including them blocks part of the causal path and biases the estimate of the treatment effect.

Post-Treatment Bias

Suppose you study the effect of a job training program (assigned in 2020) on wages in 2021. You should control for pre-treatment variables like age and education (measured before 2020). But you should not control for job title in 2021 — if training changes someone's job title, which then changes their wage, controlling for job title blocks the indirect effect and underestimates the true treatment impact.

Rule of thumb: Only control for variables that were determined before treatment was assigned. If a variable could have been affected by the treatment, leave it out. This applies to all regression-based methods: OLS, matching, DiD, and IV.

5.7 Summary Tables for Papers

# Python: Summary table with stargazer
# pip install stargazer
from stargazer.stargazer import Stargazer

# For regression tables
import statsmodels.formula.api as smf
model = smf.ols('income ~ education + age', data=df).fit()
stargazer = Stargazer([model])
print(stargazer.render_latex())

* Stata: Summary tables with estout
eststo clear
eststo: reg income education
eststo: reg income education age
esttab using "table.tex", replace se star(* 0.1 ** 0.05 *** 0.01)

# R: Summary tables with modelsummary
library(modelsummary)

model1 <- lm(income ~ education, data = df)
model2 <- lm(income ~ education + age, data = df)

modelsummary(list(model1, model2), stars = TRUE, output = "table.tex")

Python Output

>>> print(stargazer.render_latex()) \begin{table}[!htbp] \centering \caption{} \label{} \begin{tabular}{@{\extracolsep{5pt}}lc} \\[-1.8ex]\hline \hline \\[-1.8ex] & \multicolumn{1}{c}{\textit{Dependent variable:}} \\ \cline{2-2} \\[-1.8ex] & income \\ \hline \\[-1.8ex] education & 0.624$^{***}$ \\ & (0.047) \\ age & -0.008 \\ & (0.037) \\ Constant & -0.523 \\ & (1.423) \\ \hline \\[-1.8ex] Observations & 2,246 \\ R$^{2}$ & 0.146 \\ \hline \hline \\[-1.8ex] \textit{Note:} & \multicolumn{1}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\ \end{tabular} \end{table}

Stata Output

. eststo clear . eststo: reg income education (est1 stored) . eststo: reg income education age (est2 stored) . esttab using "table.tex", replace se star(* 0.1 ** 0.05 *** 0.01) (output written to table.tex) Preview of table.tex: ------------------------------------------------------------ (1) (2) income income ------------------------------------------------------------ education 0.882*** 0.624*** (0.046) (0.047) age -0.008 (0.037) _cons -3.422*** -0.523 (0.594) (1.423) ------------------------------------------------------------ N 2246 2246 R-sq 0.145 0.146 ------------------------------------------------------------ Standard errors in parentheses * p<0.1, ** p<0.05, *** p<0.01

R Output

> library(modelsummary) > model1 <- lm(income ~ education, data = df) > model2 <- lm(income ~ education + age, data = df) > modelsummary(list(model1, model2), stars = TRUE, output = "table.tex") File 'table.tex' written successfully. Preview: | | Model 1 | Model 2 | |:-----------------|:-------------:|:-------------:| | (Intercept) | -3.422*** | -0.523 | | | (0.594) | (1.423) | | education | 0.882*** | 0.624*** | | | (0.046) | (0.047) | | age | | -0.008 | | | | (0.037) | | Num.Obs. | 2246 | 2246 | | R2 | 0.145 | 0.146 | | R2 Adj. | 0.145 | 0.145 | | AIC | 13892.4 | 13894.3 | | BIC | 13909.6 | 13917.2 | + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

5.8 Exercises

Exercise 5.1: Descriptive Statistics and Visualization

Practice computing statistics and creating visualizations. Complete these tasks:

Calculate descriptive statistics (mean, median, std)
Create a histogram
Create a scatter plot

5 Data Analysis & Visualization

Learning Objectives

Table of Contents

5.1 Project Organization

Script Naming Conventions

Example Project Structure

Naming Best Practices

Language-Specific Conventions

5.2 README Files

Essential README Elements

README Template

README for Different Audiences

5.3 Descriptive Statistics

5.4 Data Visualization

Histograms

Scatter Plots

5.5 Correlation Analysis

5.6 Basic Regression Analysis

Running a Basic Regression

Understanding Regression Output

Key Components Explained

Statistical Significance

Extracting OLS Components: SSR, ESS, TSS, R²

F-Tests and Joint Significance

Common Regression Specifications

Interpreting Log Transformations

Choosing Control Variables: Avoiding Post-Treatment Bias

5.7 Summary Tables for Papers

5.8 Exercises

Exercise 5.1: Descriptive Statistics and Visualization

Solution

ProTools ER1 Assistant

ProTools ER1

Course Modules

5 Data Analysis & Visualization

Learning Objectives

Table of Contents

5.1 Project Organization

Script Naming Conventions

Example Project Structure

Naming Best Practices

Language-Specific Conventions

5.2 README Files

Essential README Elements

README Template

README for Different Audiences

5.3 Descriptive Statistics

5.4 Data Visualization

Histograms

Scatter Plots

5.5 Correlation Analysis

5.6 Basic Regression Analysis

Running a Basic Regression

Understanding Regression Output

Key Components Explained

Statistical Significance

Extracting OLS Components: SSR, ESS, TSS, R²

F-Tests and Joint Significance

Common Regression Specifications

Interpreting Log Transformations

Choosing Control Variables: Avoiding Post-Treatment Bias

5.7 Summary Tables for Papers

5.8 Exercises

Exercise 5.1: Descriptive Statistics and Visualization

Solution

ProTools ER1 Assistant