5  Data Analysis & Visualization

~8 hours EDA, Statistics, Plots Intermediate

Learning Objectives

  • Organize research projects with clear script naming conventions
  • Create and maintain README files for reproducibility
  • Compute and interpret descriptive statistics
  • Create publication-quality visualizations
  • Run and interpret regression analyses
  • Generate summary tables for papers

5.1 Project Organization

Before diving into analysis, establishing a clear project structure is essential. A well-organized project allows you (and others) to understand what each script does and the order in which to run them. This becomes critical as projects grow in complexity and when sharing code for replication.

Script Naming Conventions

Script names should reveal what they do and the order to run them. The most effective approach uses numbered prefixes combined with descriptive names:

Naming Pattern

##_action_description.ext
where ## is a two-digit number (01, 02, 03...) indicating execution order.

Example Project Structure

project/
├── data/
│   ├── raw/                    # Original, untouched data
│   │   ├── census_2020.csv
│   │   └── survey_responses.xlsx
│   ├── processed/              # Cleaned, analysis-ready data
│   │   └── analysis_sample.csv
│   └── temp/                   # Intermediate files
│
├── code/
│   ├── 01_import_data.py          # Load raw data
│   ├── 02_clean_data.py          # Handle missing values, outliers
│   ├── 03_merge_datasets.py      # Combine data sources
│   ├── 04_create_variables.py    # Generate analysis variables
│   ├── 05_descriptive_stats.py   # Summary statistics, EDA
│   ├── 06_regression_analysis.py # Main estimation
│   ├── 07_robustness_checks.py   # Sensitivity analyses
│   └── 08_create_figures.py     # Publication plots
│
├── output/
│   ├── tables/
│   └── figures/
│
└── README.md                    # Project documentation

Naming Best Practices

Practice Good Avoid
Use numbers for order 01_import.py import.py
Be descriptive 03_clean_income_vars.do 03_clean.do
Use underscores 05_merge_datasets.R 05 merge datasets.R
Lowercase 02_data_prep.py 02_Data_Prep.py
Skip numbers for flexibility 01, 02, 05, 10 1, 2, 3, 4
Why Skip Numbers?

Leave gaps in your numbering (01, 02, 05, 10...) so you can insert new scripts without renaming everything. If you later need a script between 02_clean.py and 05_merge.py, you can create 03_validate.py without disrupting the sequence.

Language-Specific Conventions

# Python: Typical project structure
# code/
#   01_import_data.py
#   02_clean_data.py
#   03_analysis.py
#   utils/               # Helper functions
#     __init__.py
#     data_helpers.py
#   config.py            # Settings, paths

# In config.py:
from pathlib import Path

PROJECT_ROOT = Path(__file__).parent.parent
DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
OUTPUT = PROJECT_ROOT / "output"
* Stata: Master do-file pattern
* Create a 00_master.do that runs everything

* ============================================
* 00_master.do - Run entire analysis pipeline
* ============================================

global PROJECT "/Users/researcher/my_project"
global CODE    "$PROJECT/code"
global DATA    "$PROJECT/data"
global OUTPUT  "$PROJECT/output"

* Run scripts in order
do "$CODE/01_import_data.do"
do "$CODE/02_clean_data.do"
do "$CODE/03_merge_datasets.do"
do "$CODE/04_analysis.do"
do "$CODE/05_create_tables.do"
# R: Typical project structure
# Using here() package for portable paths

# code/
#   00_main.R            # Master script
#   01_load_data.R
#   02_clean_data.R
#   03_analysis.R
#   R/                   # Reusable functions
#     utils.R

# In 00_main.R:
library(here)

# Source all scripts in order
source(here("code", "01_load_data.R"))
source(here("code", "02_clean_data.R"))
source(here("code", "03_analysis.R"))
Python Output
Project paths configured successfully: PROJECT_ROOT: /Users/researcher/my_project DATA_RAW: /Users/researcher/my_project/data/raw DATA_PROCESSED: /Users/researcher/my_project/data/processed OUTPUT: /Users/researcher/my_project/output
Stata Output
. global PROJECT "/Users/researcher/my_project" . global CODE "$PROJECT/code" . global DATA "$PROJECT/data" . global OUTPUT "$PROJECT/output" . do "$CODE/01_import_data.do" running 01_import_data.do ... . do "$CODE/02_clean_data.do" running 02_clean_data.do ... . do "$CODE/03_merge_datasets.do" running 03_merge_datasets.do ... . do "$CODE/04_analysis.do" running 04_analysis.do ... . do "$CODE/05_create_tables.do" running 05_create_tables.do ... All scripts completed successfully.
R Output
> library(here) here() starts at /Users/researcher/my_project > source(here("code", "01_load_data.R")) Sourcing 01_load_data.R... > source(here("code", "02_clean_data.R")) Sourcing 02_clean_data.R... > source(here("code", "03_analysis.R")) Sourcing 03_analysis.R... All scripts executed successfully.

5.2 README Files

A README file is the front door to your project. It tells collaborators (including your future self) what the project does, how to run it, and what to expect. Every research project should have one.

Deep Dive in Module 8

This section introduces README basics. For comprehensive documentation practices, including data dictionaries, codebooks, and replication packages, see Module 8: Replicability.

Essential README Elements

At minimum, your README should answer these questions:

  1. What is this project? - Brief description of the research question
  2. How do I run it? - Step-by-step instructions to reproduce results
  3. What do I need? - Software requirements, data sources
  4. What does each file do? - Description of key scripts

README Template

# Project Title

Brief description of your research project and main findings.

## Data

- **Source:** Describe where data comes from
- **Access:** How to obtain the data (public/restricted)
- **Files:** List main data files

## Requirements

### Software
- Python 3.9+ with packages in `requirements.txt`
- Stata 17 (for some analyses)
- R 4.2+ with packages in `renv.lock`

### Installation
```bash
pip install -r requirements.txt
```

## How to Replicate

1. Clone this repository
2. Download data from [source] and place in `data/raw/`
3. Run scripts in order:
   - `01_import_data.py` - Load and validate raw data
   - `02_clean_data.py` - Handle missing values, create variables
   - `03_analysis.py` - Main regression analysis
   - `04_figures.py` - Generate all figures

Or run everything with:
```bash
python 00_main.py
```

## File Structure

```
├── code/          # Analysis scripts
├── data/          # Data files (not tracked in git)
├── output/        # Tables and figures
└── README.md      # This file
```

## Authors

Your Name (your.email@university.edu)

## License

MIT License (or appropriate license)

README for Different Audiences

Audience Key Information
Replicator Exact steps to reproduce results, software versions, data access
Collaborator Code organization, where to add new analyses, coding conventions
Future You What you were thinking, why certain decisions were made
Reviewer/Editor What data is included, what is sensitive, how results map to paper

5.3 Descriptive Statistics

# Python: Descriptive statistics
import pandas as pd

df.describe()                          # Summary stats
df['income'].mean()                   # Mean
df['income'].median()                 # Median
df['income'].std()                    # Standard deviation
df['income'].quantile([0.25, 0.75])  # Percentiles

# By group
df.groupby('education')['income'].mean()
* Stata: Descriptive statistics
summarize income
summarize income, detail
tabstat income, by(education) stat(mean sd n)
# R: Descriptive statistics
summary(df$income)
mean(df$income, na.rm = TRUE)
sd(df$income, na.rm = TRUE)

# By group with dplyr
df %>% group_by(education) %>% summarize(mean_inc = mean(income))
Python Output
>>> df.describe() income education age count 1000.000000 1000.000000 1000.000000 mean 52847.632000 14.230000 42.150000 std 18234.510000 2.890000 12.340000 min 18500.000000 8.000000 22.000000 25% 38250.000000 12.000000 32.000000 50% 51200.000000 14.000000 41.000000 75% 65800.000000 16.000000 52.000000 max 125000.000000 21.000000 65.000000 >>> df['income'].mean() 52847.632 >>> df['income'].median() 51200.0 >>> df['income'].std() 18234.51 >>> df['income'].quantile([0.25, 0.75]) 0.25 38250.0 0.75 65800.0 Name: income, dtype: float64 >>> df.groupby('education')['income'].mean() education High School 35420.50 Bachelor's 52180.25 Master's 68450.80 PhD 89230.15 Name: income, dtype: float64
Stata Output
. summarize income Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- income | 1,000 52847.63 18234.51 18500 125000 . summarize income, detail income ------------------------------------------------------------- Percentiles Smallest 1% 22150 18500 5% 28420 19200 10% 31850 20100 Obs 1,000 25% 38250 21500 Sum of wgt. 1,000 50% 51200 Mean 52847.63 Largest Std. dev. 18234.51 75% 65800 112000 90% 78500 118500 Variance 3.32e+08 95% 89200 122000 Skewness .4521832 99% 105000 125000 Kurtosis 2.847123 . tabstat income, by(education) stat(mean sd n) Summary for variables: income Group variable: education education | Mean SD N -------------+------------------------------ High School | 35420.50 8234.12 250 Bachelor's | 52180.25 12456.80 400 Master's | 68450.80 15678.30 250 PhD | 89230.15 18920.45 100 -------------+------------------------------ Total | 52847.63 18234.51 1000 ------------------------------------------------
R Output
> summary(df$income) Min. 1st Qu. Median Mean 3rd Qu. Max. 18500 38250 51200 52848 65800 125000 > mean(df$income, na.rm = TRUE) [1] 52847.63 > sd(df$income, na.rm = TRUE) [1] 18234.51 > df %>% group_by(education) %>% summarize(mean_inc = mean(income)) # A tibble: 4 x 2 education mean_inc <chr> <dbl> 1 High School 35421. 2 Bachelor's 52180. 3 Master's 68451. 4 PhD 89230.

5.4 Data Visualization

Histograms

# Python: Histogram
import matplotlib.pyplot as plt

plt.hist(df['income'], bins=30, edgecolor='white')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.title('Distribution of Income')
plt.savefig('histogram.png')
* Stata: Histogram
histogram income, frequency ///
    title("Distribution of Income")
graph export "histogram.png", replace
# R: Histogram with ggplot2
ggplot(df, aes(x = income)) +
  geom_histogram(bins = 30, fill = "steelblue") +
  labs(title = "Distribution of Income")
ggsave("histogram.png")
Python Output
[Histogram plot displayed] Distribution of Income 80 | ██ | ████ 60 | ██████ | ████████ 40 | ██████████ | ████████████ 20 | ██████████████████ | ████████████████████████ 0 +--------------------------- 20k 40k 60k 80k 100k Income Figure saved to 'histogram.png'
Stata Output
. histogram income, frequency title("Distribution of Income") (bin=30, start=18500, width=3550) 80 | ██ | ████ 60 | ██████ | ████████ 40 | ██████████ | ████████████ 20 | ██████████████████ | ████████████████████████ 0 +--------------------------- 20k 40k 60k 80k 100k Income . graph export "histogram.png", replace (file histogram.png written in PNG format)
R Output
> ggplot(df, aes(x = income)) + + geom_histogram(bins = 30, fill = "steelblue") + + labs(title = "Distribution of Income") 80 | ██ | ████ 60 | ██████ | ████████ 40 | ██████████ | ████████████ 20 | ██████████████████ | ████████████████████████ 0 +--------------------------- 20k 40k 60k 80k 100k Income > ggsave("histogram.png") Saving 7 x 5 in image

Scatter Plots

# Python: Scatter plot
import seaborn as sns

sns.scatterplot(data=df, x='gdpPercap', y='lifeExp', hue='continent')
plt.xscale('log')
plt.savefig('scatter.png')
* Stata: Scatter plot
twoway (scatter lifeExp gdpPercap), xscale(log)
# R: Scatter plot
ggplot(df, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point() +
  scale_x_log10()
Python Output
[Scatter plot displayed] Life Expectancy vs GDP per Capita (log scale) 80 | o o * * | o o *** 70 | + ooo * * | x ++ o 60 | xx + | xxx 50 | xx +------------------------ 1k 5k 10k 50k 100k GDP per Capita (log) Legend: x=Africa +=Americas o=Asia *=Europe Figure saved to 'scatter.png'
Stata Output
. twoway (scatter lifeExp gdpPercap), xscale(log) Life Expectancy vs GDP per Capita 80 | o o | o o o 70 | o o o o | o o o 60 | o o o | o o o 50 | o o +------------------------ 1k 5k 10k 50k 100k GDP per Capita (log) (graph displayed in Graph window)
R Output
> ggplot(df, aes(x = gdpPercap, y = lifeExp, color = continent)) + + geom_point() + + scale_x_log10() Life Expectancy vs GDP per Capita (log scale) 80 | o o * * | o o *** 70 | + ooo * * | x ++ o 60 | xx + | xxx 50 | xx +------------------------ 1k 5k 10k 50k 100k GDP per Capita (log) Legend: x=Africa +=Americas o=Asia *=Europe

5.5 Correlation Analysis

# Python: Correlation
df[['income', 'education', 'age']].corr()

# With p-values
from scipy import stats
r, p = stats.pearsonr(df['income'], df['education'])
* Stata: Correlation
correlate income education age
pwcorr income education age, sig star(.05)
# R: Correlation
cor(df[c("income", "education", "age")], use = "complete.obs")
cor.test(df$income, df$education)
Python Output
>>> df[['income', 'education', 'age']].corr() income education age income 1.000000 0.652341 0.234567 education 0.652341 1.000000 0.089123 age 0.234567 0.089123 1.000000 >>> from scipy import stats >>> r, p = stats.pearsonr(df['income'], df['education']) >>> print(f"Correlation: r = {r:.4f}, p-value = {p:.4e}") Correlation: r = 0.6523, p-value = 2.34e-112 The correlation between income and education is statistically significant (p < 0.001)
Stata Output
. correlate income education age (obs=1,000) | income education age -------------+--------------------------- income | 1.0000 education | 0.6523 1.0000 age | 0.2346 0.0891 1.0000 . pwcorr income education age, sig star(.05) | income education age -------------+--------------------------- income | 1.0000 | | education | 0.6523* 1.0000 | 0.0000 | age | 0.2346* 0.0891* 1.0000 | 0.0000 0.0047 | * indicates significance at the 5% level
R Output
> cor(df[c("income", "education", "age")], use = "complete.obs") income education age income 1.0000000 0.6523413 0.2345671 education 0.6523413 1.0000000 0.0891234 age 0.2345671 0.0891234 1.0000000 > cor.test(df$income, df$education) Pearson's product-moment correlation data: df$income and df$education t = 27.234, df = 998, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.6098 0.6921 sample estimates: cor 0.6523413 Strong positive correlation, highly significant (p < 0.001)

5.6 Basic Regression Analysis

Regression analysis is the foundation of quantitative empirical research. This section covers how to run regressions and—critically—how to interpret what the output means. Understanding these basics is essential before moving to causal inference methods in Module 6.

Running a Basic Regression

# Python: OLS Regression with statsmodels
import statsmodels.formula.api as smf

# Simple regression: income on education
model = smf.ols('income ~ education', data=df).fit()

# Multiple regression: add controls
model = smf.ols('income ~ education + age + experience', data=df).fit()

# View results
print(model.summary())
* Stata: OLS Regression

* Simple regression
reg income education

* Multiple regression with controls
reg income education age experience

* Robust standard errors (heteroskedasticity-robust)
reg income education age experience, robust

* Clustered standard errors
reg income education age experience, cluster(state)
# R: OLS Regression

# Simple regression
model <- lm(income ~ education, data = df)

# Multiple regression with controls
model <- lm(income ~ education + age + experience, data = df)

# View results
summary(model)

# Robust standard errors (using sandwich package)
library(sandwich)
library(lmtest)
coeftest(model, vcov = vcovHC(model, type = "HC1"))
Python Output
>>> print(model.summary()) OLS Regression Results ============================================================================== Dep. Variable: income R-squared: 0.478 Model: OLS Adj. R-squared: 0.476 Method: Least Squares F-statistic: 304.2 Date: Mon, 27 Jan 2026 Prob (F-statistic): 1.23e-134 Time: 14:32:15 Log-Likelihood: -11245. No. Observations: 1000 AIC: 2.250e+04 Df Residuals: 996 BIC: 2.252e+04 Df Model: 3 ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 15234.50 1205.320 12.639 0.000 12869.45 17599.55 education 2847.63 142.580 19.972 0.000 2567.83 3127.43 age 156.24 28.450 5.491 0.000 100.39 212.09 experience 312.45 45.230 6.908 0.000 223.66 401.24 ============================================================================== Omnibus: 12.345 Durbin-Watson: 2.012 Prob(Omnibus): 0.002 Jarque-Bera (JB): 14.567 Skew: 0.234 Prob(JB): 0.000689 Kurtosis: 3.456 Cond. No. 156. ==============================================================================
Stata Output
. reg income education age experience Source | SS df MS Number of obs = 1,000 -------------+---------------------------------- F(3, 996) = 304.23 Model | 1.5892e+11 3 5.2973e+10 Prob > F = 0.0000 Residual | 1.7342e+11 996 174117430 R-squared = 0.4782 -------------+---------------------------------- Adj R-squared = 0.4766 Total | 3.3234e+11 999 332671671 Root MSE = 13195 ------------------------------------------------------------------------------ income | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- education | 2847.632 142.5801 19.97 0.000 2567.834 3127.429 age | 156.243 28.44956 5.49 0.000 100.3936 212.092 experience | 312.451 45.23245 6.91 0.000 223.6567 401.2453 _cons | 15234.502 1205.320 12.64 0.000 12869.454 17599.550 ------------------------------------------------------------------------------ . reg income education age experience, robust (Robust standard errors - coefficients identical, SEs may differ) . reg income education age experience, cluster(state) (Clustered by state - 50 clusters)
R Output
> summary(model) Call: lm(formula = income ~ education + age + experience, data = df) Residuals: Min 1Q Median 3Q Max -38542 -8934 -234 8756 52341 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 15234.50 1205.32 12.639 < 2e-16 *** education 2847.63 142.58 19.972 < 2e-16 *** age 156.24 28.45 5.491 4.89e-08 *** experience 312.45 45.23 6.908 7.82e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 13200 on 996 degrees of freedom Multiple R-squared: 0.4782, Adjusted R-squared: 0.4766 F-statistic: 304.2 on 3 and 996 DF, p-value: < 2.2e-16 > coeftest(model, vcov = vcovHC(model, type = "HC1")) t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 15234.502 1312.456 11.6073 < 2.2e-16 *** education 2847.632 156.234 18.2273 < 2.2e-16 *** age 156.243 31.234 5.0024 6.45e-07 *** experience 312.451 48.567 6.4332 1.89e-10 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Understanding Regression Output

A regression output contains many statistics. Here's what each one means and how to interpret it:

Example Output

============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 15234.50 1205.32 12.639 0.000 12869.45 17599.55 education 2847.63 142.58 19.972 0.000 2567.83 3127.43 age 156.24 28.45 5.491 0.000 100.39 212.09 ============================================================================== R-squared: 0.423 F-statistic: 285.7 Prob (F-statistic): 1.23e-89 ==============================================================================

Key Components Explained

Component What It Tells You Example Interpretation
Coefficient (coef) The estimated effect of a one-unit change in X on Y One more year of education is associated with $2,847.63 higher income
Standard Error (std err) Uncertainty in the coefficient estimate Our estimate of education's effect could vary by about $142.58
t-statistic (t) Coefficient divided by standard error; tests if coef differs from 0 19.97 is very large, indicating strong evidence of an effect
P-value (P>|t|) Probability of seeing this result if true effect were zero 0.000 means extremely unlikely to see this by chance
Confidence Interval Range that contains the true coefficient 95% of the time True education effect is likely between $2,568 and $3,127
R-squared Proportion of variance in Y explained by the model Education and age explain 42.3% of income variation
F-statistic Tests if all coefficients together are zero High F (285.7) with low p-value means model explains something
Correlation vs. Causation

A regression coefficient measures association, not causation. The education coefficient tells us that higher education is associated with higher income—not that education causes higher income. People who get more education may differ in other ways (ability, motivation, family background) that also affect income. To make causal claims, you need the methods covered in Module 6: Causal Inference.

Statistical Significance

A coefficient is "statistically significant" when we can be confident it differs from zero. Common thresholds:

  • p < 0.01 (***): Very strong evidence
  • p < 0.05 (**): Strong evidence (most common threshold)
  • p < 0.10 (*): Weak evidence

However, statistical significance tells you nothing about practical significance. A coefficient can be statistically significant but economically tiny—or statistically insignificant but potentially meaningful with more data.

Common Regression Specifications

# Python: Common specifications
import statsmodels.formula.api as smf
import numpy as np

# Log transformation
df['log_income'] = np.log(df['income'])
model = smf.ols('log_income ~ education + age', data=df).fit()

# Quadratic term (non-linear relationship)
model = smf.ols('income ~ age + I(age**2)', data=df).fit()

# Interaction terms
model = smf.ols('income ~ education * gender', data=df).fit()

# Categorical variables (dummy encoding)
model = smf.ols('income ~ education + C(region)', data=df).fit()
* Stata: Common specifications

* Log transformation
gen log_income = ln(income)
reg log_income education age

* Quadratic term
reg income c.age##c.age

* Interaction terms
reg income c.education##i.female

* Categorical variables
reg income education i.region

* Fixed effects
areg income education age, absorb(state)
# R: Common specifications

# Log transformation
model <- lm(log(income) ~ education + age, data = df)

# Quadratic term
model <- lm(income ~ age + I(age^2), data = df)

# Interaction terms
model <- lm(income ~ education * gender, data = df)

# Categorical variables
model <- lm(income ~ education + factor(region), data = df)

# Fixed effects (using fixest package)
library(fixest)
model <- feols(income ~ education + age | state, data = df)
Python Output
# Log transformation model results: coef std err t P>|t| Intercept 9.2345 0.052 177.59 0.000 education 0.0543 0.003 18.10 0.000 ← 5.43% increase per year of education age 0.0028 0.001 2.80 0.005 ← 0.28% increase per year of age # Quadratic term model: coef std err t P>|t| Intercept -12456.23 3245.12 -3.84 0.000 age 1823.45 156.78 11.63 0.000 I(age**2) -15.67 1.89 -8.29 0.000 ← Negative: diminishing returns to age # Interaction model: coef std err t P>|t| Intercept 12345.67 1234.56 10.00 0.000 education 2567.89 167.89 15.29 0.000 gender[T.Male] 8234.56 1567.89 5.25 0.000 education:gender -234.56 89.12 -2.63 0.009 ← Gender gap narrows with education # Categorical region model: coef std err t P>|t| Intercept 25678.90 1456.78 17.63 0.000 education 2789.34 145.67 19.15 0.000 C(region)[T.South] -3456.78 890.12 -3.88 0.000 C(region)[T.West] 2345.67 912.34 2.57 0.010 C(region)[T.Midwest] -1234.56 878.90 -1.40 0.161
Stata Output
. gen log_income = ln(income) . reg log_income education age ------------------------------------------------------------------------------ log_income | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- education | .0543012 .0030012 18.10 0.000 .0484112 .0601912 age | .0028456 .0010162 2.80 0.005 .0008512 .0048400 _cons | 9.234512 .0520123 177.59 0.000 9.132456 9.336568 ------------------------------------------------------------------------------ Interpretation: 1 year of education → 5.43% higher income . reg income c.age##c.age ------------------------------------------------------------------------------ income | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1823.456 156.7823 11.63 0.000 1515.789 2131.123 | c.age#c.age | -15.6789 1.89123 -8.29 0.000 -19.39012 -11.96768 | _cons | -12456.23 3245.123 -3.84 0.000 -18824.56 -6087.904 ------------------------------------------------------------------------------ Negative age-squared: income peaks then declines . reg income c.education##i.female ------------------------------------------------------------------------------ income | Coef. Std. Err. t P>|t| -------------+---------------------------------------------------------------- education | 2567.890 167.8901 15.29 0.000 1.female | -8234.560 1567.890 -5.25 0.000 | c.education# | 1.female | 234.560 89.1234 2.63 0.009 | _cons | 12345.670 1234.560 10.00 0.000 ------------------------------------------------------------------------------ Positive interaction: education effect larger for women
R Output
> model <- lm(log(income) ~ education + age, data = df) > summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.2345120 0.0520123 177.587 < 2e-16 *** education 0.0543012 0.0030012 18.093 < 2e-16 *** # 5.43% per year age 0.0028456 0.0010162 2.800 0.00521 ** # 0.28% per year > model <- lm(income ~ age + I(age^2), data = df) > summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -12456.23 3245.12 -3.838 0.000131 *** age 1823.46 156.78 11.630 < 2e-16 *** I(age^2) -15.68 1.89 -8.291 < 2e-16 *** # Peak at age ~58 > model <- lm(income ~ education * gender, data = df) > summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12345.67 1234.56 10.000 < 2e-16 *** education 2567.89 167.89 15.294 < 2e-16 *** genderMale 8234.56 1567.89 5.251 1.89e-07 *** education:genderMale -234.56 89.12 -2.632 0.00862 ** > library(fixest) > model <- feols(income ~ education + age | state, data = df) > summary(model) OLS estimation, Dep. Var.: income Observations: 1,000 Fixed-effects: state: 50 Standard-errors: Clustered (state) Estimate Std. Error t value Pr(>|t|) education 2654.32 189.45 14.014 < 2.2e-16 *** age 145.67 34.56 4.214 0.0001034 ***

Interpreting Log Transformations

Model Form Interpretation of b1
Linear Y = b0 + b1*X 1 unit increase in X → b1 unit change in Y
Log-Linear log(Y) = b0 + b1*X 1 unit increase in X → (b1 × 100)% change in Y
Linear-Log Y = b0 + b1*log(X) 1% increase in X → b1/100 unit change in Y
Log-Log log(Y) = b0 + b1*log(X) 1% increase in X → b1% change in Y (elasticity)

5.7 Summary Tables for Papers

# Python: Summary table with stargazer
# pip install stargazer
from stargazer.stargazer import Stargazer

# For regression tables
import statsmodels.formula.api as smf
model = smf.ols('income ~ education + age', data=df).fit()
stargazer = Stargazer([model])
print(stargazer.render_latex())
* Stata: Summary tables with estout
eststo clear
eststo: reg income education
eststo: reg income education age
esttab using "table.tex", replace se star(* 0.1 ** 0.05 *** 0.01)
# R: Summary tables with modelsummary
library(modelsummary)

model1 <- lm(income ~ education, data = df)
model2 <- lm(income ~ education + age, data = df)

modelsummary(list(model1, model2), stars = TRUE, output = "table.tex")
Python Output
>>> print(stargazer.render_latex()) \begin{table}[!htbp] \centering \caption{} \label{} \begin{tabular}{@{\extracolsep{5pt}}lc} \\[-1.8ex]\hline \hline \\[-1.8ex] & \multicolumn{1}{c}{\textit{Dependent variable:}} \\ \cline{2-2} \\[-1.8ex] & income \\ \hline \\[-1.8ex] education & 2847.632$^{***}$ \\ & (142.580) \\ age & 156.243$^{***}$ \\ & (28.450) \\ Constant & 15234.502$^{***}$ \\ & (1205.320) \\ \hline \\[-1.8ex] Observations & 1,000 \\ R$^{2}$ & 0.423 \\ \hline \hline \\[-1.8ex] \textit{Note:} & \multicolumn{1}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\ \end{tabular} \end{table}
Stata Output
. eststo clear . eststo: reg income education (est1 stored) . eststo: reg income education age (est2 stored) . esttab using "table.tex", replace se star(* 0.1 ** 0.05 *** 0.01) (output written to table.tex) Preview of table.tex: ------------------------------------------------------------ (1) (2) income income ------------------------------------------------------------ education 3124.56*** 2847.63*** (148.23) (142.58) age 156.24*** (28.45) _cons 12456.78*** 15234.50*** (1345.67) (1205.32) ------------------------------------------------------------ N 1000 1000 R-sq 0.385 0.423 ------------------------------------------------------------ Standard errors in parentheses * p<0.1, ** p<0.05, *** p<0.01
R Output
> library(modelsummary) > model1 <- lm(income ~ education, data = df) > model2 <- lm(income ~ education + age, data = df) > modelsummary(list(model1, model2), stars = TRUE, output = "table.tex") File 'table.tex' written successfully. Preview: | | Model 1 | Model 2 | |:-----------------|:-------------:|:-------------:| | (Intercept) | 12456.78*** | 15234.50*** | | | (1345.67) | (1205.32) | | education | 3124.56*** | 2847.63*** | | | (148.23) | (142.58) | | age | | 156.24*** | | | | (28.45) | | Num.Obs. | 1000 | 1000 | | R2 | 0.385 | 0.423 | | R2 Adj. | 0.384 | 0.422 | | AIC | 22512.3 | 22489.1 | | BIC | 22527.1 | 22508.8 | + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

5.8 Exercises

Exercise 5.1: Descriptive Statistics and Visualization

Practice computing statistics and creating visualizations. Complete these tasks:

  1. Calculate descriptive statistics (mean, median, std)
  2. Create a histogram
  3. Create a scatter plot