3 Data Exploration
Learning Objectives
- Load data from CSV, Excel, and Stata formats
- Understand dataset structure before any analysis
- Build a comprehensive "First Analysis" script
- Explore distributions, ranges, and correlations
I witnessed many students—(and occasionally also senior researchers...)—make a costly mistake: jumping straight into analysis without first getting to know their data.
The consequences: running regressions on variables with 80% missing values, drawing conclusions from data entry errors, missing obvious patterns, etc. In this module I'll share a more disciplined approach.
The Data Exploration Workflow
Import data
Structure & types
Distributions
Patterns
Relationships
Step 1: Loading Data
What is "Loading Data"?
Loading data means reading information from a file into your program's memory. The file stays unchanged—you create a working copy. Different formats (CSV, Excel, Stata) require different commands.
For our research project, we'll use World Bank development indicators. Click Run to see the output for each language:
# COMPLETE EXAMPLE: Load World Bank data directly from URL
# This demonstrates loading CSV data from the internet
# Import pandas - the main data manipulation library in Python
import pandas as pd
# World Bank provides CSV downloads - we use GDP data
# This URL points to a publicly available dataset on GitHub
url = "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv"
# read_csv() loads CSV data into a DataFrame
# It automatically detects column names and data types
df = pd.read_csv(url)
# Verify the load worked by checking dimensions
# len(df) returns number of rows, len(df.columns) returns number of columns
print(f"Loaded {len(df):,} rows and {len(df.columns)} columns")
print(f"Columns: {list(df.columns)}")
* COMPLETE EXAMPLE: Load World Bank data
* Stata can import CSV files directly from URLs
* Clear any existing data from memory
clear all
* Import CSV file from URL into Stata
* 'delimited' means comma-separated values
* 'clear' option replaces any data in memory
import delimited "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv", clear
* Verify the load - describe shows variable info
* 'short' option gives a compact summary
describe, short
* List first 5 observations to preview the data
list in 1/5
# COMPLETE EXAMPLE: Load World Bank data directly from URL
# R can read CSV files from URLs natively
# Load tidyverse - a collection of R packages for data science
# Includes readr (for reading data), dplyr (for manipulation), ggplot2 (for visualization)
library(tidyverse)
# Define the URL pointing to our CSV data
url <- "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv"
# read_csv() from readr package loads the data into a tibble
# A tibble is a modern version of R's data.frame
df <- read_csv(url)
# Print dimensions and column names
# nrow() = number of rows, ncol() = number of columns
cat(sprintf("Loaded %s rows and %d columns\n",
format(nrow(df), big.mark=","), ncol(df)))
print(names(df))
Step 2: Inspecting Structure
Before analyzing data, we need to understand its structure: what variables exist, their types, and any missing values. Knowing the data structure will prevent many mistakes!
# Continuing from above - inspect the data structure
# This is crucial before any analysis!
# info() shows data types and non-null counts for each column
# This reveals missing values and helps identify type issues
print("=== DATA STRUCTURE ===")
print(df.info())
# head() shows first rows - always visually inspect!
# Default is 5 rows, but you can specify: head(10)
print("\n=== FIRST 5 ROWS ===")
print(df.head())
* Continuing - inspect structure
* Always check your data before analysis!
* Full variable description including type, format, labels
describe
* First observations - visual inspection is crucial
list in 1/5
# Continuing - inspect structure
# Understanding structure prevents analysis mistakes
# str() shows structure: variable types, sample values
# This is R's equivalent of pandas info()
str(df)
# head() shows first 6 rows by default
head(df)
Step 3: Summary Statistics
Now we calculate key statistics to understand our data's distribution and identify potential issues.
# Summary statistics - understand your distributions
# describe() gives count, mean, std, min, 25%, 50%, 75%, max
# Only works on numeric columns by default
print("=== SUMMARY STATISTICS ===")
print(df.describe())
# Missing values analysis - CRITICAL for research!
# isnull() returns True/False, sum() counts True values
print("\n=== MISSING VALUES ===")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(1)
print(pd.DataFrame({'Missing': missing, 'Percent': missing_pct}))
* Summary statistics
* 'detail' option gives more info: percentiles, skewness, kurtosis
summarize, detail
* Missing values analysis
* misstable shows patterns of missing data
misstable summarize
# Summary statistics
# summary() gives min, 1st quartile, median, mean, 3rd quartile, max
summary(df)
# Missing values count per column
# is.na() returns TRUE/FALSE, colSums() sums each column
cat("\nMissing values:\n")
colSums(is.na(df))
25% of GDP values are missing. Before any analysis, we need to understand why. Are they missing for small countries? Recent years? This is exactly what the "First Analysis" approach catches early.
Step 4-5: Visualization & Correlations
Visual inspection reveals patterns that summary statistics may miss. I personally like to have code produce figures files rather than just displaying them inline. This also makes them easy to include in reports for coauthors, as well as later in papers and presentations. As usual, go with your mouse over the code below to inspect the commands that save the files (the code underlined with dashed line).
# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os
# Create output directory for figures
os.makedirs('figures', exist_ok=True)
# Prepare data
df_viz = df.copy()
df_viz['GDP_billions'] = df_viz['Value'] / 1e9
df_viz['log_GDP'] = np.log10(df_viz['Value'].replace(0, np.nan))
df_viz['Decade'] = (df_viz['Year'] // 10) * 10
# ─────────────────────────────────────────────────────────────
# FIGURE 1: Distribution of GDP (histogram + density)
# ─────────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(8, 5))
sns.histplot(df_viz['log_GDP'].dropna(), kde=True, bins=40, ax=ax)
ax.set_xlabel('Log10(GDP in USD)')
ax.set_ylabel('Frequency')
ax.set_title('Distribution of GDP (log scale)')
fig.savefig('figures/01_gdp_distribution.png', dpi=300, bbox_inches='tight')
plt.close()
print("Saved: figures/01_gdp_distribution.png")
# ─────────────────────────────────────────────────────────────
# FIGURE 2: Box plot by decade
# ─────────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(10, 5))
sns.boxplot(data=df_viz, x='Decade', y='log_GDP', ax=ax)
ax.set_ylabel('Log10(GDP)')
ax.set_title('GDP Distribution by Decade')
ax.tick_params(axis='x', rotation=45)
fig.savefig('figures/02_gdp_boxplot_decades.png', dpi=300, bbox_inches='tight')
plt.close()
print("Saved: figures/02_gdp_boxplot_decades.png")
# ─────────────────────────────────────────────────────────────
# FIGURE 3: Time series for major economies
# ─────────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(10, 6))
countries = ['USA', 'CHN', 'DEU', 'BRA', 'IND']
for country in countries:
data = df_viz[df_viz['Country Code'] == country]
ax.plot(data['Year'], data['GDP_billions'], label=country, linewidth=2)
ax.set_xlabel('Year')
ax.set_ylabel('GDP (Billion USD)')
ax.set_title('GDP Trends: Major Economies')
ax.legend()
ax.grid(True, alpha=0.3)
fig.savefig('figures/03_gdp_trends.png', dpi=300, bbox_inches='tight')
fig.savefig('figures/03_gdp_trends.pdf', bbox_inches='tight')
plt.close()
print("Saved: figures/03_gdp_trends.png and .pdf")
# ─────────────────────────────────────────────────────────────
# FIGURE 4: Scatter plot - all countries over time
# ─────────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(10, 6))
sns.scatterplot(data=df_viz.dropna(), x='Year', y='log_GDP', alpha=0.3, s=15, ax=ax)
ax.set_ylabel('Log10(GDP)')
ax.set_title('All Countries: GDP Over Time')
fig.savefig('figures/04_gdp_scatter.png', dpi=300, bbox_inches='tight')
plt.close()
print("Saved: figures/04_gdp_scatter.png")
# ─────────────────────────────────────────────────────────────
# FIGURE 5: Density comparison across decades
# ─────────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(8, 5))
for decade in [1980, 2000, 2020]:
data = df_viz[(df_viz['Decade'] == decade) & (df_viz['log_GDP'].notna())]
sns.kdeplot(data['log_GDP'], label=f'{decade}s', linewidth=2, ax=ax)
ax.set_xlabel('Log10(GDP)')
ax.set_ylabel('Density')
ax.set_title('GDP Distribution Shift Over Decades')
ax.legend(title='Decade')
fig.savefig('figures/05_gdp_density_decades.png', dpi=300, bbox_inches='tight')
plt.close()
print("Saved: figures/05_gdp_density_decades.png")
# List all generated figures
print("\n─── All figures saved to ./figures/ ───")
for f in sorted(os.listdir('figures')):
print(f" {f}")
* Create output directory for figures
capture mkdir "figures"
* Prepare data
gen gdp_billions = value / 1e9
gen log_gdp = log10(value)
gen decade = int(year/10)*10
* ─────────────────────────────────────────────────────────────
* FIGURE 1: Distribution of GDP (histogram + density)
* ─────────────────────────────────────────────────────────────
histogram log_gdp, kdensity frequency ///
title("Distribution of GDP (log scale)") ///
xtitle("Log10(GDP in USD)")
graph export "figures/01_gdp_distribution.png", replace
* ─────────────────────────────────────────────────────────────
* FIGURE 2: Box plot by decade
* ─────────────────────────────────────────────────────────────
graph box log_gdp, over(decade) ///
title("GDP Distribution by Decade") ///
ytitle("Log10(GDP)")
graph export "figures/02_gdp_boxplot_decades.png", replace
* ─────────────────────────────────────────────────────────────
* FIGURE 3: Time series for major economies
* ─────────────────────────────────────────────────────────────
twoway (line gdp_billions year if countrycode == "USA", lwidth(medium)) ///
(line gdp_billions year if countrycode == "CHN", lwidth(medium)) ///
(line gdp_billions year if countrycode == "DEU", lwidth(medium)) ///
(line gdp_billions year if countrycode == "BRA", lwidth(medium)) ///
(line gdp_billions year if countrycode == "IND", lwidth(medium)), ///
legend(label(1 "USA") label(2 "China") label(3 "Germany") ///
label(4 "Brazil") label(5 "India")) ///
title("GDP Trends: Major Economies") ///
ytitle("GDP (Billion USD)") xtitle("Year")
graph export "figures/03_gdp_trends.png", replace
graph export "figures/03_gdp_trends.pdf", replace
* ─────────────────────────────────────────────────────────────
* FIGURE 4: Scatter plot - all countries over time
* ─────────────────────────────────────────────────────────────
scatter log_gdp year, msize(tiny) mcolor(%30) ///
title("All Countries: GDP Over Time") ///
ytitle("Log10(GDP)") xtitle("Year")
graph export "figures/04_gdp_scatter.png", replace
* ─────────────────────────────────────────────────────────────
* FIGURE 5: Density comparison across decades
* ─────────────────────────────────────────────────────────────
twoway (kdensity log_gdp if decade == 1980, lwidth(medium)) ///
(kdensity log_gdp if decade == 2000, lwidth(medium)) ///
(kdensity log_gdp if decade == 2020, lwidth(medium)), ///
legend(label(1 "1980s") label(2 "2000s") label(3 "2020s")) ///
title("GDP Distribution Shift Over Decades") ///
xtitle("Log10(GDP)") ytitle("Density")
graph export "figures/05_gdp_density_decades.png", replace
* List all generated figures
dir "figures/"
# Load required libraries
library(tidyverse)
# Create output directory for figures
dir.create("figures", showWarnings = FALSE)
# Prepare data
df_viz <- df %>%
mutate(
GDP_billions = Value / 1e9,
log_GDP = log10(Value),
Decade = (Year %/% 10) * 10
)
# ─────────────────────────────────────────────────────────────
# FIGURE 1: Distribution of GDP (histogram + density)
# ─────────────────────────────────────────────────────────────
p1 <- ggplot(df_viz, aes(x = log_GDP)) +
geom_histogram(aes(y = after_stat(density)), bins = 40,
fill = "steelblue", alpha = 0.7) +
geom_density(color = "darkred", linewidth = 1) +
labs(x = "Log10(GDP in USD)", y = "Density",
title = "Distribution of GDP (log scale)")
ggsave("figures/01_gdp_distribution.png", p1, width = 8, height = 5, dpi = 300)
cat("Saved: figures/01_gdp_distribution.png\n")
# ─────────────────────────────────────────────────────────────
# FIGURE 2: Box plot by decade
# ─────────────────────────────────────────────────────────────
p2 <- df_viz %>%
filter(!is.na(log_GDP)) %>%
ggplot(aes(x = factor(Decade), y = log_GDP)) +
geom_boxplot(fill = "lightblue", outlier.alpha = 0.3) +
labs(x = "Decade", y = "Log10(GDP)",
title = "GDP Distribution by Decade")
ggsave("figures/02_gdp_boxplot_decades.png", p2, width = 10, height = 5, dpi = 300)
cat("Saved: figures/02_gdp_boxplot_decades.png\n")
# ─────────────────────────────────────────────────────────────
# FIGURE 3: Time series for major economies
# ─────────────────────────────────────────────────────────────
countries <- c("USA", "CHN", "DEU", "BRA", "IND")
p3 <- df_viz %>%
filter(`Country Code` %in% countries) %>%
ggplot(aes(x = Year, y = GDP_billions, color = `Country Code`)) +
geom_line(linewidth = 1.2) +
labs(y = "GDP (Billion USD)", title = "GDP Trends: Major Economies") +
theme(legend.position = "bottom")
ggsave("figures/03_gdp_trends.png", p3, width = 10, height = 6, dpi = 300)
ggsave("figures/03_gdp_trends.pdf", p3, width = 10, height = 6)
cat("Saved: figures/03_gdp_trends.png and .pdf\n")
# ─────────────────────────────────────────────────────────────
# FIGURE 4: Scatter plot - all countries over time
# ─────────────────────────────────────────────────────────────
p4 <- df_viz %>%
filter(!is.na(log_GDP)) %>%
ggplot(aes(x = Year, y = log_GDP)) +
geom_point(alpha = 0.2, size = 0.8) +
labs(y = "Log10(GDP)", title = "All Countries: GDP Over Time")
ggsave("figures/04_gdp_scatter.png", p4, width = 10, height = 6, dpi = 300)
cat("Saved: figures/04_gdp_scatter.png\n")
# ─────────────────────────────────────────────────────────────
# FIGURE 5: Density comparison across decades
# ─────────────────────────────────────────────────────────────
p5 <- df_viz %>%
filter(Decade %in% c(1980, 2000, 2020), !is.na(log_GDP)) %>%
ggplot(aes(x = log_GDP, color = factor(Decade))) +
geom_density(linewidth = 1.2) +
labs(x = "Log10(GDP)", y = "Density", color = "Decade",
title = "GDP Distribution Shift Over Decades")
ggsave("figures/05_gdp_density_decades.png", p5, width = 8, height = 5, dpi = 300)
cat("Saved: figures/05_gdp_density_decades.png\n")
# List all generated figures
cat("\n─── All figures saved to ./figures/ ───\n")
list.files("figures")
Saved figure files:
Saved figure files:
Saved figure files:
Complete "First Analysis" Script
Use this script as guideline to propduce one you can run on every new dataset before any analysis:
Click to expand complete Python script
# ============================================
# FIRST ANALYSIS SCRIPT - You can run something like this on every new dataset!
# ============================================
import pandas as pd
import matplotlib.pyplot as plt
# 1. LOAD DATA - Replace URL with your data source
url = "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv"
df = pd.read_csv(url)
print("="*50)
print("FIRST ANALYSIS REPORT")
print("="*50)
# 2. STRUCTURE - Understand dimensions and types
print(f"\nRows: {len(df):,} | Columns: {len(df.columns)}")
print(f"\nColumn types:\n{df.dtypes}")
# 3. MISSING VALUES - Critical for research quality
print("\n--- MISSING VALUES ---")
missing_pct = (df.isnull().sum() / len(df) * 100).round(1)
print(missing_pct[missing_pct > 0])
# 4. SUMMARY STATISTICS - Understand distributions
print("\n--- SUMMARY STATISTICS ---")
print(df.describe())
# 5. HISTOGRAMS - Visual inspection of all numeric columns
df.select_dtypes(include=['number']).hist(figsize=(12, 6), bins=30)
plt.tight_layout()
plt.savefig('first_analysis_histograms.png')
plt.show()
print("\n" + "="*50)
print("ANALYSIS COMPLETE")
print("="*50)