5 Data Analysis & Visualization
Learning Objectives
- Organize research projects with clear script naming conventions
- Create and maintain README files for reproducibility
- Compute and interpret descriptive statistics
- Create publication-quality visualizations
- Run and interpret regression analyses
- Generate summary tables for papers
Table of Contents
5.1 Project Organization
Before diving into analysis, establishing a clear project structure is essential. A well-organized project allows you (and others) to understand what each script does and the order in which to run them. This becomes critical as projects grow in complexity and when sharing code for replication.
Script Naming Conventions
Script names should reveal what they do and the order to run them. The most effective approach uses numbered prefixes combined with descriptive names:
##_action_description.ext
where ## is a two-digit number (01, 02, 03...) indicating execution order.
Example Project Structure
project/
├── data/
│ ├── raw/ # Original, untouched data
│ │ ├── census_2020.csv
│ │ └── survey_responses.xlsx
│ ├── processed/ # Cleaned, analysis-ready data
│ │ └── analysis_sample.csv
│ └── temp/ # Intermediate files
│
├── code/
│ ├── 01_import_data.py # Load raw data
│ ├── 02_clean_data.py # Handle missing values, outliers
│ ├── 03_merge_datasets.py # Combine data sources
│ ├── 04_create_variables.py # Generate analysis variables
│ ├── 05_descriptive_stats.py # Summary statistics, EDA
│ ├── 06_regression_analysis.py # Main estimation
│ ├── 07_robustness_checks.py # Sensitivity analyses
│ └── 08_create_figures.py # Publication plots
│
├── output/
│ ├── tables/
│ └── figures/
│
└── README.md # Project documentation
Naming Best Practices
| Practice | Good | Avoid |
|---|---|---|
| Use numbers for order | 01_import.py |
import.py |
| Be descriptive | 03_clean_income_vars.do |
03_clean.do |
| Use underscores | 05_merge_datasets.R |
05 merge datasets.R |
| Lowercase | 02_data_prep.py |
02_Data_Prep.py |
| Skip numbers for flexibility | 01, 02, 05, 10 |
1, 2, 3, 4 |
Leave gaps in your numbering (01, 02, 05, 10...) so you can insert new scripts without renaming everything. If you later need a script between 02_clean.py and 05_merge.py, you can create 03_validate.py without disrupting the sequence.
Language-Specific Conventions
# Python: Typical project structure
# code/
# 01_import_data.py
# 02_clean_data.py
# 03_analysis.py
# utils/ # Helper functions
# __init__.py
# data_helpers.py
# config.py # Settings, paths
# In config.py:
from pathlib import Path
PROJECT_ROOT = Path(__file__).parent.parent
DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
OUTPUT = PROJECT_ROOT / "output"
* Stata: Master do-file pattern
* Create a 00_master.do that runs everything
* ============================================
* 00_master.do - Run entire analysis pipeline
* ============================================
global PROJECT "/Users/researcher/my_project"
global CODE "$PROJECT/code"
global DATA "$PROJECT/data"
global OUTPUT "$PROJECT/output"
* Run scripts in order
do "$CODE/01_import_data.do"
do "$CODE/02_clean_data.do"
do "$CODE/03_merge_datasets.do"
do "$CODE/04_analysis.do"
do "$CODE/05_create_tables.do"
# R: Typical project structure
# Using here() package for portable paths
# code/
# 00_main.R # Master script
# 01_load_data.R
# 02_clean_data.R
# 03_analysis.R
# R/ # Reusable functions
# utils.R
# In 00_main.R:
library(here)
# Source all scripts in order
source(here("code", "01_load_data.R"))
source(here("code", "02_clean_data.R"))
source(here("code", "03_analysis.R"))
5.2 README Files
A README file is the front door to your project. It tells collaborators (including your future self) what the project does, how to run it, and what to expect. Every research project should have one.
This section introduces README basics. For comprehensive documentation practices, including data dictionaries, codebooks, and replication packages, see Module 8: Replicability.
Essential README Elements
At minimum, your README should answer these questions:
- What is this project? - Brief description of the research question
- How do I run it? - Step-by-step instructions to reproduce results
- What do I need? - Software requirements, data sources
- What does each file do? - Description of key scripts
README Template
# Project Title
Brief description of your research project and main findings.
## Data
- **Source:** Describe where data comes from
- **Access:** How to obtain the data (public/restricted)
- **Files:** List main data files
## Requirements
### Software
- Python 3.9+ with packages in `requirements.txt`
- Stata 17 (for some analyses)
- R 4.2+ with packages in `renv.lock`
### Installation
```bash
pip install -r requirements.txt
```
## How to Replicate
1. Clone this repository
2. Download data from [source] and place in `data/raw/`
3. Run scripts in order:
- `01_import_data.py` - Load and validate raw data
- `02_clean_data.py` - Handle missing values, create variables
- `03_analysis.py` - Main regression analysis
- `04_figures.py` - Generate all figures
Or run everything with:
```bash
python 00_main.py
```
## File Structure
```
├── code/ # Analysis scripts
├── data/ # Data files (not tracked in git)
├── output/ # Tables and figures
└── README.md # This file
```
## Authors
Your Name (your.email@university.edu)
## License
MIT License (or appropriate license)
README for Different Audiences
| Audience | Key Information |
|---|---|
| Replicator | Exact steps to reproduce results, software versions, data access |
| Collaborator | Code organization, where to add new analyses, coding conventions |
| Future You | What you were thinking, why certain decisions were made |
| Reviewer/Editor | What data is included, what is sensitive, how results map to paper |
5.3 Descriptive Statistics
# Python: Descriptive statistics
import pandas as pd
df.describe() # Summary stats
df['income'].mean() # Mean
df['income'].median() # Median
df['income'].std() # Standard deviation
df['income'].quantile([0.25, 0.75]) # Percentiles
# By group
df.groupby('education')['income'].mean()
* Stata: Descriptive statistics
summarize income
summarize income, detail
tabstat income, by(education) stat(mean sd n)
# R: Descriptive statistics
summary(df$income)
mean(df$income, na.rm = TRUE)
sd(df$income, na.rm = TRUE)
# By group with dplyr
df %>% group_by(education) %>% summarize(mean_inc = mean(income))
5.4 Data Visualization
Histograms
# Python: Histogram
import matplotlib.pyplot as plt
plt.hist(df['income'], bins=30, edgecolor='white')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.title('Distribution of Income')
plt.savefig('histogram.png')
* Stata: Histogram
histogram income, frequency ///
title("Distribution of Income")
graph export "histogram.png", replace
# R: Histogram with ggplot2
ggplot(df, aes(x = income)) +
geom_histogram(bins = 30, fill = "steelblue") +
labs(title = "Distribution of Income")
ggsave("histogram.png")
Scatter Plots
# Python: Scatter plot
import seaborn as sns
sns.scatterplot(data=df, x='gdpPercap', y='lifeExp', hue='continent')
plt.xscale('log')
plt.savefig('scatter.png')
* Stata: Scatter plot
twoway (scatter lifeExp gdpPercap), xscale(log)
# R: Scatter plot
ggplot(df, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10()
5.5 Correlation Analysis
# Python: Correlation
df[['income', 'education', 'age']].corr()
# With p-values
from scipy import stats
r, p = stats.pearsonr(df['income'], df['education'])
* Stata: Correlation
correlate income education age
pwcorr income education age, sig star(.05)
# R: Correlation
cor(df[c("income", "education", "age")], use = "complete.obs")
cor.test(df$income, df$education)
5.6 Basic Regression Analysis
Regression analysis is the foundation of quantitative empirical research. This section covers how to run regressions and—critically—how to interpret what the output means. Understanding these basics is essential before moving to causal inference methods in Module 6.
Running a Basic Regression
# Python: OLS Regression with statsmodels
import statsmodels.formula.api as smf
# Simple regression: income on education
model = smf.ols('income ~ education', data=df).fit()
# Multiple regression: add controls
model = smf.ols('income ~ education + age + experience', data=df).fit()
# View results
print(model.summary())
* Stata: OLS Regression
* Simple regression
reg income education
* Multiple regression with controls
reg income education age experience
* Robust standard errors (heteroskedasticity-robust)
reg income education age experience, robust
* Clustered standard errors
reg income education age experience, cluster(state)
# R: OLS Regression
# Simple regression
model <- lm(income ~ education, data = df)
# Multiple regression with controls
model <- lm(income ~ education + age + experience, data = df)
# View results
summary(model)
# Robust standard errors (using sandwich package)
library(sandwich)
library(lmtest)
coeftest(model, vcov = vcovHC(model, type = "HC1"))
Understanding Regression Output
A regression output contains many statistics. Here's what each one means and how to interpret it:
============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 15234.50 1205.32 12.639 0.000 12869.45 17599.55 education 2847.63 142.58 19.972 0.000 2567.83 3127.43 age 156.24 28.45 5.491 0.000 100.39 212.09 ============================================================================== R-squared: 0.423 F-statistic: 285.7 Prob (F-statistic): 1.23e-89 ==============================================================================
Key Components Explained
| Component | What It Tells You | Example Interpretation |
|---|---|---|
| Coefficient (coef) | The estimated effect of a one-unit change in X on Y | One more year of education is associated with $2,847.63 higher income |
| Standard Error (std err) | Uncertainty in the coefficient estimate | Our estimate of education's effect could vary by about $142.58 |
| t-statistic (t) | Coefficient divided by standard error; tests if coef differs from 0 | 19.97 is very large, indicating strong evidence of an effect |
| P-value (P>|t|) | Probability of seeing this result if true effect were zero | 0.000 means extremely unlikely to see this by chance |
| Confidence Interval | Range that contains the true coefficient 95% of the time | True education effect is likely between $2,568 and $3,127 |
| R-squared | Proportion of variance in Y explained by the model | Education and age explain 42.3% of income variation |
| F-statistic | Tests if all coefficients together are zero | High F (285.7) with low p-value means model explains something |
A regression coefficient measures association, not causation. The education coefficient tells us that higher education is associated with higher income—not that education causes higher income. People who get more education may differ in other ways (ability, motivation, family background) that also affect income. To make causal claims, you need the methods covered in Module 6: Causal Inference.
Statistical Significance
A coefficient is "statistically significant" when we can be confident it differs from zero. Common thresholds:
- p < 0.01 (***): Very strong evidence
- p < 0.05 (**): Strong evidence (most common threshold)
- p < 0.10 (*): Weak evidence
However, statistical significance tells you nothing about practical significance. A coefficient can be statistically significant but economically tiny—or statistically insignificant but potentially meaningful with more data.
Common Regression Specifications
# Python: Common specifications
import statsmodels.formula.api as smf
import numpy as np
# Log transformation
df['log_income'] = np.log(df['income'])
model = smf.ols('log_income ~ education + age', data=df).fit()
# Quadratic term (non-linear relationship)
model = smf.ols('income ~ age + I(age**2)', data=df).fit()
# Interaction terms
model = smf.ols('income ~ education * gender', data=df).fit()
# Categorical variables (dummy encoding)
model = smf.ols('income ~ education + C(region)', data=df).fit()
* Stata: Common specifications
* Log transformation
gen log_income = ln(income)
reg log_income education age
* Quadratic term
reg income c.age##c.age
* Interaction terms
reg income c.education##i.female
* Categorical variables
reg income education i.region
* Fixed effects
areg income education age, absorb(state)
# R: Common specifications
# Log transformation
model <- lm(log(income) ~ education + age, data = df)
# Quadratic term
model <- lm(income ~ age + I(age^2), data = df)
# Interaction terms
model <- lm(income ~ education * gender, data = df)
# Categorical variables
model <- lm(income ~ education + factor(region), data = df)
# Fixed effects (using fixest package)
library(fixest)
model <- feols(income ~ education + age | state, data = df)
Interpreting Log Transformations
| Model | Form | Interpretation of b1 |
|---|---|---|
| Linear | Y = b0 + b1*X | 1 unit increase in X → b1 unit change in Y |
| Log-Linear | log(Y) = b0 + b1*X | 1 unit increase in X → (b1 × 100)% change in Y |
| Linear-Log | Y = b0 + b1*log(X) | 1% increase in X → b1/100 unit change in Y |
| Log-Log | log(Y) = b0 + b1*log(X) | 1% increase in X → b1% change in Y (elasticity) |
5.7 Summary Tables for Papers
# Python: Summary table with stargazer
# pip install stargazer
from stargazer.stargazer import Stargazer
# For regression tables
import statsmodels.formula.api as smf
model = smf.ols('income ~ education + age', data=df).fit()
stargazer = Stargazer([model])
print(stargazer.render_latex())
* Stata: Summary tables with estout
eststo clear
eststo: reg income education
eststo: reg income education age
esttab using "table.tex", replace se star(* 0.1 ** 0.05 *** 0.01)
# R: Summary tables with modelsummary
library(modelsummary)
model1 <- lm(income ~ education, data = df)
model2 <- lm(income ~ education + age, data = df)
modelsummary(list(model1, model2), stars = TRUE, output = "table.tex")
5.8 Exercises
Exercise 5.1: Descriptive Statistics and Visualization
Practice computing statistics and creating visualizations. Complete these tasks:
- Calculate descriptive statistics (mean, median, std)
- Create a histogram
- Create a scatter plot