3 Data Exploration

~4 hours Loading, Understanding, Exploring Beginner-Intermediate

Learning Objectives

Load data from CSV, Excel, and Stata formats
Understand dataset structure before any analysis
Build a comprehensive "First Analysis" script
Explore distributions, ranges, and correlations

Course Research Project: Climate Policy & Economic Development

Throughout this course, we'll build a complete research project worthy of publication. Our question:

"How does climate vulnerability affect economic growth across countries, and do international climate agreements moderate this relationship?"

This project uses real World Bank data, involves API calls, text analysis of policy documents, and causal inference methods. Each module advances the project.

A Critical Mistake I Often see

I witnessed many students—(and occasionally also senior researchers...)—make a costly mistake: jumping straight into analysis without first getting to know their data.

The consequences: running regressions on variables with 80% missing values, drawing conclusions from data entry errors, missing obvious patterns, etc. In this module I'll share a more disciplined approach.

The Data Exploration Workflow

Load

Import data

→

Inspect

Structure & types

→

Summarize

Distributions

→

Visualize

Patterns

→

Correlate

Relationships

Step 1: Loading Data

What is "Loading Data"?

Loading data means reading information from a file into your program's memory. The file stays unchanged—you create a working copy. Different formats (CSV, Excel, Stata) require different commands.

For our research project, we'll use World Bank development indicators. Click Run to see the output for each language:

# COMPLETE EXAMPLE: Load World Bank data directly from URL
# This demonstrates loading CSV data from the internet

# Import pandas - the main data manipulation library in Python
import pandas as pd

# World Bank provides CSV downloads - we use GDP data
# This URL points to a publicly available dataset on GitHub
url = "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv"

# read_csv() loads CSV data into a DataFrame
# It automatically detects column names and data types
df = pd.read_csv(url)

# Verify the load worked by checking dimensions
# len(df) returns number of rows, len(df.columns) returns number of columns
print(f"Loaded {len(df):,} rows and {len(df.columns)} columns")
print(f"Columns: {list(df.columns)}")

* COMPLETE EXAMPLE: Load World Bank data
* Stata can import CSV files directly from URLs

* Clear any existing data from memory
clear all

* Import CSV file from URL into Stata
* 'delimited' means comma-separated values
* 'clear' option replaces any data in memory
import delimited "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv", clear

* Verify the load - describe shows variable info
* 'short' option gives a compact summary
describe, short

* List first 5 observations to preview the data
list in 1/5

# COMPLETE EXAMPLE: Load World Bank data directly from URL
# R can read CSV files from URLs natively

# Load tidyverse - a collection of R packages for data science
# Includes readr (for reading data), dplyr (for manipulation), ggplot2 (for visualization)
library(tidyverse)

# Define the URL pointing to our CSV data
url <- "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv"

# read_csv() from readr package loads the data into a tibble
# A tibble is a modern version of R's data.frame
df <- read_csv(url)

# Print dimensions and column names
# nrow() = number of rows, ncol() = number of columns
cat(sprintf("Loaded %s rows and %d columns\n",
    format(nrow(df), big.mark=","), ncol(df)))
print(names(df))

Python Output (VS Code)

Loaded 14,112 rows and 4 columns Columns: ['Country Name', 'Country Code', 'Year', 'Value']

Stata Output

. import delimited "https://...gdp.csv", clear (encoding automatically selected: UTF-8) (4 vars, 14,112 obs) . describe, short Contains data Observations: 14,112 Variables: 4 . list in 1/5 +--------------------------------------------------+ | countryname countrycode year value | |--------------------------------------------------| 1. | Arab World ARB 1968 2.5761e+10 | 2. | Arab World ARB 1969 2.8434e+10 | 3. | Arab World ARB 1970 3.1386e+10 | 4. | Arab World ARB 1971 3.5875e+10 | 5. | Arab World ARB 1972 4.2509e+10 | +--------------------------------------------------+

R Output (RStudio Console)

Rows: 14112 Columns: 4 ── Column specification ────────────────────────────── Delimiter: "," chr (2): Country Name, Country Code dbl (2): Year, Value Loaded 14,112 rows and 4 columns [1] "Country Name" "Country Code" "Year" "Value"

Step 2: Inspecting Structure

Before analyzing data, we need to understand its structure: what variables exist, their types, and any missing values. Knowing the data structure will prevent many mistakes!

# Continuing from above - inspect the data structure
# This is crucial before any analysis!

# info() shows data types and non-null counts for each column
# This reveals missing values and helps identify type issues
print("=== DATA STRUCTURE ===")
print(df.info())

# head() shows first rows - always visually inspect!
# Default is 5 rows, but you can specify: head(10)
print("\n=== FIRST 5 ROWS ===")
print(df.head())

* Continuing - inspect structure
* Always check your data before analysis!

* Full variable description including type, format, labels
describe

* First observations - visual inspection is crucial
list in 1/5

# Continuing - inspect structure
# Understanding structure prevents analysis mistakes

# str() shows structure: variable types, sample values
# This is R's equivalent of pandas info()
str(df)

# head() shows first 6 rows by default
head(df)

Python Output

=== DATA STRUCTURE === <class 'pandas.core.frame.DataFrame'> RangeIndex: 14112 entries, 0 to 14111 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country Name 14112 non-null object 1 Country Code 14112 non-null object 2 Year 14112 non-null int64 3 Value 10585 non-null float64 === FIRST 5 ROWS === Country Name Country Code Year Value 0 Arab World ARB 1968 2.576068e+10 1 Arab World ARB 1969 2.843420e+10 2 Arab World ARB 1970 3.138565e+10 3 Arab World ARB 1971 3.587548e+10 4 Arab World ARB 1972 4.250944e+10

Stata Output

. describe Contains data Observations: 14,112 Variables: 4 Variable Storage Display Value name type format label ─────────────────────────────────────────── countryname str52 %52s countrycode str3 %9s year int %10.0g value double %10.0g . list in 1/5 +--------------------------------------------------+ | countryname countrycode year value | |--------------------------------------------------| 1. | Arab World ARB 1968 2.5761e+10 | 2. | Arab World ARB 1969 2.8434e+10 | 3. | Arab World ARB 1970 3.1386e+10 | 4. | Arab World ARB 1971 3.5875e+10 | 5. | Arab World ARB 1972 4.2509e+10 | +--------------------------------------------------+

R Output (RStudio Console)

> str(df) tibble [14,112 x 4] (S3: tbl_df/tbl/data.frame) $ Country Name: chr "Arab World" "Arab World" "Arab World" ... $ Country Code: chr "ARB" "ARB" "ARB" "ARB" ... $ Year : num 1968 1969 1970 1971 1972 ... $ Value : num 2.58e+10 2.84e+10 3.14e+10 3.59e+10 4.25e+10 ... > head(df) # A tibble: 6 x 4 `Country Name` `Country Code` Year Value <chr> <chr> <dbl> <dbl> 1 Arab World ARB 1968 25760680000 2 Arab World ARB 1969 28434200000 3 Arab World ARB 1970 31385650000 4 Arab World ARB 1971 35875480000 5 Arab World ARB 1972 42509440000 6 Arab World ARB 1973 54832410000

Step 3: Summary Statistics

Now we calculate key statistics to understand our data's distribution and identify potential issues.

# Summary statistics - understand your distributions

# describe() gives count, mean, std, min, 25%, 50%, 75%, max
# Only works on numeric columns by default
print("=== SUMMARY STATISTICS ===")
print(df.describe())

# Missing values analysis - CRITICAL for research!
# isnull() returns True/False, sum() counts True values
print("\n=== MISSING VALUES ===")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(1)
print(pd.DataFrame({'Missing': missing, 'Percent': missing_pct}))

* Summary statistics
* 'detail' option gives more info: percentiles, skewness, kurtosis
summarize, detail

* Missing values analysis
* misstable shows patterns of missing data
misstable summarize

# Summary statistics
# summary() gives min, 1st quartile, median, mean, 3rd quartile, max
summary(df)

# Missing values count per column
# is.na() returns TRUE/FALSE, colSums() sums each column
cat("\nMissing values:\n")
colSums(is.na(df))

Python Output

=== SUMMARY STATISTICS === Year Value count 14112.00 10585.00 mean 1993.50 3.21e+11 std 15.59 1.42e+12 min 1960.00 2.69e+06 25% 1976.75 2.23e+09 50% 1993.50 1.45e+10 75% 2010.25 9.89e+10 max 2027.00 2.55e+13 === MISSING VALUES === Missing Percent Country Name 0 0.0 Country Code 0 0.0 Year 0 0.0 Value 3527 25.0

Stata Output

. summarize, detail year ───────────────────────────────────────────────────── Percentiles Smallest 1% 1968 1960 5% 1970 1960 10% 1972 1960 Obs 14,112 25% 1977 1960 Sum of wgt. 14,112 50% 1994 Mean 1993.5 Largest Std. dev. 15.589 75% 2010 2027 90% 2016 2027 Variance 243.02 95% 2019 2027 Skewness 0 99% 2024 2027 Kurtosis 1.8 . misstable summarize Obs<. ───────── Obs. Variable Unique missing name values Min Max ───────────────────────────────────────────────── 3,527 value 10585 . .

R Output (RStudio Console)

> summary(df) Country Name Country Code Year Value Length:14112 Length:14112 Min. :1960 Min. :2.69e+06 Class :character Class :character 1st Qu.:1977 1st Qu.:2.23e+09 Mode :character Mode :character Median :1994 Median :1.45e+10 Mean :1994 Mean :3.21e+11 3rd Qu.:2010 3rd Qu.:9.89e+10 Max. :2027 Max. :2.55e+13 NA's :3527 Missing values: Country Name Country Code Year Value 0 0 0 3527

Red Flag Detected!

25% of GDP values are missing. Before any analysis, we need to understand why. Are they missing for small countries? Recent years? This is exactly what the "First Analysis" approach catches early.

Cross-Tabulations and Group-Level Statistics

We found 25% of GDP values are missing—but where are they concentrated? Cross-tabulations let us investigate patterns across groups (here, across decades) and compute group-level summaries. In other contexts, the same tools let you compare treated vs. control groups, callback rates by race, or outcomes across regions. These are essential for descriptive analysis before running any regression.

# Python: Cross-tabulations and group statistics
# Continuing from above - df is our World Bank GDP data

# Create useful grouping variables from existing data
df['Decade'] = (df['Year'] // 10) * 10
df['has_gdp'] = df['Value'].notna().astype(int)

# Two-way frequency table: data availability by decade
freq_table = pd.crosstab(df['Decade'], df['has_gdp'], margins=True)
print(freq_table)

# Row proportions (share of countries with GDP data per decade)
prop_table = pd.crosstab(df['Decade'], df['has_gdp'], normalize='index')
print(prop_table)

# Group-level summary statistics: GDP by decade
group_stats = df.groupby('Decade')['Value'].agg(
    ['mean', 'std', 'count']
)
print(group_stats)

* Stata: Cross-tabulations and group statistics
* Continuing from above - GDP data is already loaded

* Create useful grouping variables
gen decade = int(year/10)*10
gen has_gdp = !missing(value)

* Two-way frequency table: data availability by decade
tab decade has_gdp

* With row and column percentages
tab decade has_gdp, row col

* Group-level summary statistics: GDP by decade
tabstat value, by(decade) stat(mean sd count)

* Detailed summary by group
bysort decade: summarize value, detail

* Table with chi-squared test
tab decade has_gdp, chi2

# R: Cross-tabulations and group statistics
# Continuing from above - df is our World Bank GDP data

# Create useful grouping variables
df$Decade  <- (df$Year %/% 10) * 10
df$has_gdp <- as.integer(!is.na(df$Value))

# Two-way frequency table: data availability by decade
freq_table <- table(df$Decade, df$has_gdp)
addmargins(freq_table)

# Row proportions (share with GDP data per decade)
prop.table(table(df$Decade, df$has_gdp), margin = 1)

# Group means using aggregate
aggregate(Value ~ Decade, data = df, FUN = mean)

# Tidyverse approach: group_by + summarise
df %>%
  group_by(Decade) %>%
  summarise(
    mean_gdp      = mean(Value, na.rm = TRUE),
    sd_gdp        = sd(Value, na.rm = TRUE),
    n             = n(),
    pct_available = mean(!is.na(Value)) * 100
  )

Python Output

Frequency Table (data availability by decade): has_gdp 0 1 All Decade 1960 312 158 470 1970 580 1770 2350 1980 442 1908 2350 1990 378 1972 2350 2000 340 2010 2350 2010 295 2055 2350 2020 1180 712 1892 All 3527 10585 14112 Row Proportions: has_gdp 0 1 Decade 1960 0.6638 0.3362 1970 0.2468 0.7532 1980 0.1881 0.8119 1990 0.1609 0.8391 2000 0.1447 0.8553 2010 0.1255 0.8745 2020 0.6237 0.3763 Group Statistics (GDP by decade): mean std count Decade 1960 7.81e+09 1.82e+10 158 1970 3.45e+10 9.12e+10 1770 1980 8.67e+10 2.34e+11 1908 1990 1.54e+11 4.21e+11 1972 2000 2.89e+11 9.45e+11 2010 2010 4.12e+11 1.38e+12 2055 2020 5.67e+11 1.89e+12 712

Stata Output

. tab decade has_gdp | has_gdp decade | 0 1 | Total -----------+----------------------+---------- 1960 | 312 158 | 470 1970 | 580 1,770 | 2,350 1980 | 442 1,908 | 2,350 1990 | 378 1,972 | 2,350 2000 | 340 2,010 | 2,350 2010 | 295 2,055 | 2,350 2020 | 1,180 712 | 1,892 -----------+----------------------+---------- Total | 3,527 10,585 | 14,112 . tabstat value, by(decade) stat(mean sd count) Summary statistics: mean, sd, count by categories of: decade decade | value ----------+------------------------------ 1960 | 7.81e+09 | 1.82e+10 | 158 ----------+------------------------------ 1970 | 3.45e+10 | 9.12e+10 | 1,770 ----------+------------------------------ 1980 | 8.67e+10 | 2.34e+11 | 1,908 ----------+------------------------------ 1990 | 1.54e+11 | 4.21e+11 | 1,972 ----------+------------------------------ 2000 | 2.89e+11 | 9.45e+11 | 2,010 ----------+------------------------------ 2010 | 4.12e+11 | 1.38e+12 | 2,055 ----------+------------------------------ 2020 | 5.67e+11 | 1.89e+12 | 712 ----------+------------------------------ Total | 3.21e+11 | 1.42e+12 | 10,585 ----------------------------------------------

R Output

Frequency Table with margins: 0 1 Sum 1960 312 158 470 1970 580 1770 2350 1980 442 1908 2350 1990 378 1972 2350 2000 340 2010 2350 2010 295 2055 2350 2020 1180 712 1892 Sum 3527 10585 14112 Row Proportions: 0 1 1960 0.6638 0.3362 1970 0.2468 0.7532 1980 0.1881 0.8119 1990 0.1609 0.8391 2000 0.1447 0.8553 2010 0.1255 0.8745 2020 0.6237 0.3763 Group Statistics (tidyverse): # A tibble: 7 x 5 Decade mean_gdp sd_gdp n pct_available <dbl> <dbl> <dbl> <int> <dbl> 1 1960 7810000000. 18200000000. 470 33.6 2 1970 34500000000. 91200000000. 2350 75.3 3 1980 86700000000. 2.34e+11 2350 81.2 4 1990 1.54e+11 4.21e+11 2350 83.9 5 2000 2.89e+11 9.45e+11 2350 85.5 6 2010 4.12e+11 1.38e+12 2350 87.4 7 2020 5.67e+11 1.89e+12 1892 37.6

When to Use Cross-Tabulations

Cross-tabulations are particularly useful for spotting patterns in data availability (as we did here with missing GDP values), checking balance between treatment and control groups, computing relative frequencies (e.g., callback rates by race in audit studies), and creating summary tables for papers. For continuous outcomes, use aggregate() (R), tabstat (Stata), or groupby().agg() (Python).

Advanced GroupBy: Multi-Index Results and Group Transformations

When you group by two or more columns, the result has a multi-level index (also called a hierarchical index). Understanding how to navigate this structure is essential for panel data analysis.

Multi-Index GroupBy Results

Grouping by two columns (e.g., country and sector) produces a multi-level index where the first grouping variable is the outer level and the second is the inner level. You can use .loc[] to select a slice of the outer level.

# Grouping by two columns creates a multi-level index
result = df.groupby(['country', 'sector'])['employment'].sum()

# result is a Series with a multi-level index:
#   country   sector
#   France    Agriculture    150000
#             Industry       820000
#             Services      1500000
#   Germany   Agriculture    120000
#             ...

# Use .loc[] on the outer level to get one country's breakdown
france_sectors = result.loc['France']
# Returns a Series indexed by sector:
#   Agriculture     150000
#   Industry        820000
#   Services       1500000

# To flatten back to a regular DataFrame:
result_flat = result.reset_index()

* Stata: Grouped summaries do not create multi-level indices
* Instead, use collapse or table with multiple by-variables

* Option 1: collapse (replaces data in memory)
collapse (sum) employment, by(country sector)
list if country == "France"

* Option 2: table (displays without modifying data)
table country sector, statistic(sum employment)

# R: Grouped summaries produce a regular tibble, not a multi-index
result <- df %>%
  group_by(country, sector) %>%
  summarise(total_emp = sum(employment, na.rm = TRUE), .groups = "drop")

# Filter to one country
france_sectors <- result %>% filter(country == "France")

Group Transformations: Adding Group Statistics Without Collapsing

Sometimes you want to add a group-level statistic (like the group mean) as a new column to every row, without collapsing the data. This is different from groupby().mean() or collapse, which reduce each group to a single row.

# transform() adds a group statistic to every row WITHOUT collapsing
df['group_mean_income'] = df.groupby('region')['income'].transform('mean')

# Now each row has a new column with its region's average income
# Useful for computing deviations from group means:
df['income_deviation'] = df['income'] - df['group_mean_income']

* egen with bysort: adds group stats without collapsing
bysort region: egen group_mean_income = mean(income)

* Now compute deviations from the group mean
gen income_deviation = income - group_mean_income

# mutate() with group_by() adds group stats without collapsing
df <- df %>%
  group_by(region) %>%
  mutate(group_mean_income = mean(income, na.rm = TRUE)) %>%
  ungroup()

# Compute deviations from the group mean
df <- df %>% mutate(income_deviation = income - group_mean_income)

Key distinction: groupby().mean() / collapse / summarise() reduce each group to one row. transform() / bysort: egen / mutate() with group_by() preserve every row and add a new column. This is critical for computing within-group deviations, demeaning panel data, or adding group-level controls to individual observations.

Step 4-5: Visualization & Correlations

Visual inspection reveals patterns that summary statistics may miss. I personally like to have code produce figures files rather than just displaying them inline. This also makes them easy to include in reports for coauthors, as well as later in papers and presentations. As usual, go with your mouse over the code below to inspect the commands that save the files (the code underlined with dashed line).

# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os

# Create output directory for figures
os.makedirs('figures', exist_ok=True)

# Prepare data
df_viz = df.copy()
df_viz['GDP_billions'] = df_viz['Value'] / 1e9
df_viz['log_GDP'] = np.log10(df_viz['Value'].replace(0, np.nan))
df_viz['Decade'] = (df_viz['Year'] // 10) * 10

# ─────────────────────────────────────────────────────────────
# FIGURE 1: Distribution of GDP (histogram + density)
# ─────────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(8, 5))
sns.histplot(df_viz['log_GDP'].dropna(), kde=True, bins=40, ax=ax)
ax.set_xlabel('Log10(GDP in USD)')
ax.set_ylabel('Frequency')
ax.set_title('Distribution of GDP (log scale)')
fig.savefig('figures/01_gdp_distribution.png', dpi=300, bbox_inches='tight')
plt.close()
print("Saved: figures/01_gdp_distribution.png")

# ─────────────────────────────────────────────────────────────
# FIGURE 2: Box plot by decade
# ─────────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(10, 5))
sns.boxplot(data=df_viz, x='Decade', y='log_GDP', ax=ax)
ax.set_ylabel('Log10(GDP)')
ax.set_title('GDP Distribution by Decade')
ax.tick_params(axis='x', rotation=45)
fig.savefig('figures/02_gdp_boxplot_decades.png', dpi=300, bbox_inches='tight')
plt.close()
print("Saved: figures/02_gdp_boxplot_decades.png")

# ─────────────────────────────────────────────────────────────
# FIGURE 3: Time series for major economies
# ─────────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(10, 6))
countries = ['USA', 'CHN', 'DEU', 'BRA', 'IND']
for country in countries:
    data = df_viz[df_viz['Country Code'] == country]
    ax.plot(data['Year'], data['GDP_billions'], label=country, linewidth=2)
ax.set_xlabel('Year')
ax.set_ylabel('GDP (Billion USD)')
ax.set_title('GDP Trends: Major Economies')
ax.legend()
ax.grid(True, alpha=0.3)
fig.savefig('figures/03_gdp_trends.png', dpi=300, bbox_inches='tight')
fig.savefig('figures/03_gdp_trends.pdf', bbox_inches='tight')
plt.close()
print("Saved: figures/03_gdp_trends.png and .pdf")

# ─────────────────────────────────────────────────────────────
# FIGURE 4: Scatter plot - all countries over time
# ─────────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(10, 6))
sns.scatterplot(data=df_viz.dropna(), x='Year', y='log_GDP', alpha=0.3, s=15, ax=ax)
ax.set_ylabel('Log10(GDP)')
ax.set_title('All Countries: GDP Over Time')
fig.savefig('figures/04_gdp_scatter.png', dpi=300, bbox_inches='tight')
plt.close()
print("Saved: figures/04_gdp_scatter.png")

# ─────────────────────────────────────────────────────────────
# FIGURE 5: Density comparison across decades
# ─────────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(8, 5))
for decade in [1980, 2000, 2020]:
    data = df_viz[(df_viz['Decade'] == decade) & (df_viz['log_GDP'].notna())]
    sns.kdeplot(data['log_GDP'], label=f'{decade}s', linewidth=2, ax=ax)
ax.set_xlabel('Log10(GDP)')
ax.set_ylabel('Density')
ax.set_title('GDP Distribution Shift Over Decades')
ax.legend(title='Decade')
fig.savefig('figures/05_gdp_density_decades.png', dpi=300, bbox_inches='tight')
plt.close()
print("Saved: figures/05_gdp_density_decades.png")

# List all generated figures
print("\n─── All figures saved to ./figures/ ───")
for f in sorted(os.listdir('figures')):
    print(f"  {f}")

* Create output directory for figures
capture mkdir "figures"

* Prepare data
gen gdp_billions = value / 1e9
gen log_gdp = log10(value)
gen decade = int(year/10)*10

* ─────────────────────────────────────────────────────────────
* FIGURE 1: Distribution of GDP (histogram + density)
* ─────────────────────────────────────────────────────────────
histogram log_gdp, kdensity frequency ///
    title("Distribution of GDP (log scale)") ///
    xtitle("Log10(GDP in USD)")
graph export "figures/01_gdp_distribution.png", replace

* ─────────────────────────────────────────────────────────────
* FIGURE 2: Box plot by decade
* ─────────────────────────────────────────────────────────────
graph box log_gdp, over(decade) ///
    title("GDP Distribution by Decade") ///
    ytitle("Log10(GDP)")
graph export "figures/02_gdp_boxplot_decades.png", replace

* ─────────────────────────────────────────────────────────────
* FIGURE 3: Time series for major economies
* ─────────────────────────────────────────────────────────────
twoway (line gdp_billions year if countrycode == "USA", lwidth(medium)) ///
       (line gdp_billions year if countrycode == "CHN", lwidth(medium)) ///
       (line gdp_billions year if countrycode == "DEU", lwidth(medium)) ///
       (line gdp_billions year if countrycode == "BRA", lwidth(medium)) ///
       (line gdp_billions year if countrycode == "IND", lwidth(medium)), ///
       legend(label(1 "USA") label(2 "China") label(3 "Germany") ///
              label(4 "Brazil") label(5 "India")) ///
       title("GDP Trends: Major Economies") ///
       ytitle("GDP (Billion USD)") xtitle("Year")
graph export "figures/03_gdp_trends.png", replace
graph export "figures/03_gdp_trends.pdf", replace

* ─────────────────────────────────────────────────────────────
* FIGURE 4: Scatter plot - all countries over time
* ─────────────────────────────────────────────────────────────
scatter log_gdp year, msize(tiny) mcolor(%30) ///
    title("All Countries: GDP Over Time") ///
    ytitle("Log10(GDP)") xtitle("Year")
graph export "figures/04_gdp_scatter.png", replace

* ─────────────────────────────────────────────────────────────
* FIGURE 5: Density comparison across decades
* ─────────────────────────────────────────────────────────────
twoway (kdensity log_gdp if decade == 1980, lwidth(medium)) ///
       (kdensity log_gdp if decade == 2000, lwidth(medium)) ///
       (kdensity log_gdp if decade == 2020, lwidth(medium)), ///
       legend(label(1 "1980s") label(2 "2000s") label(3 "2020s")) ///
       title("GDP Distribution Shift Over Decades") ///
       xtitle("Log10(GDP)") ytitle("Density")
graph export "figures/05_gdp_density_decades.png", replace

* List all generated figures
dir "figures/"

# Load required libraries
library(tidyverse)

# Create output directory for figures
dir.create("figures", showWarnings = FALSE)

# Prepare data
df_viz <- df %>%
  mutate(
    GDP_billions = Value / 1e9,
    log_GDP = log10(Value),
    Decade = (Year %/% 10) * 10
  )

# ─────────────────────────────────────────────────────────────
# FIGURE 1: Distribution of GDP (histogram + density)
# ─────────────────────────────────────────────────────────────
p1 <- ggplot(df_viz, aes(x = log_GDP)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40,
                 fill = "steelblue", alpha = 0.7) +
  geom_density(color = "darkred", linewidth = 1) +
  labs(x = "Log10(GDP in USD)", y = "Density",
       title = "Distribution of GDP (log scale)")
ggsave("figures/01_gdp_distribution.png", p1, width = 8, height = 5, dpi = 300)
cat("Saved: figures/01_gdp_distribution.png\n")

# ─────────────────────────────────────────────────────────────
# FIGURE 2: Box plot by decade
# ─────────────────────────────────────────────────────────────
p2 <- df_viz %>%
  filter(!is.na(log_GDP)) %>%
  ggplot(aes(x = factor(Decade), y = log_GDP)) +
  geom_boxplot(fill = "lightblue", outlier.alpha = 0.3) +
  labs(x = "Decade", y = "Log10(GDP)",
       title = "GDP Distribution by Decade")
ggsave("figures/02_gdp_boxplot_decades.png", p2, width = 10, height = 5, dpi = 300)
cat("Saved: figures/02_gdp_boxplot_decades.png\n")

# ─────────────────────────────────────────────────────────────
# FIGURE 3: Time series for major economies
# ─────────────────────────────────────────────────────────────
countries <- c("USA", "CHN", "DEU", "BRA", "IND")
p3 <- df_viz %>%
  filter(`Country Code` %in% countries) %>%
  ggplot(aes(x = Year, y = GDP_billions, color = `Country Code`)) +
  geom_line(linewidth = 1.2) +
  labs(y = "GDP (Billion USD)", title = "GDP Trends: Major Economies") +
  theme(legend.position = "bottom")
ggsave("figures/03_gdp_trends.png", p3, width = 10, height = 6, dpi = 300)
ggsave("figures/03_gdp_trends.pdf", p3, width = 10, height = 6)
cat("Saved: figures/03_gdp_trends.png and .pdf\n")

# ─────────────────────────────────────────────────────────────
# FIGURE 4: Scatter plot - all countries over time
# ─────────────────────────────────────────────────────────────
p4 <- df_viz %>%
  filter(!is.na(log_GDP)) %>%
  ggplot(aes(x = Year, y = log_GDP)) +
  geom_point(alpha = 0.2, size = 0.8) +
  labs(y = "Log10(GDP)", title = "All Countries: GDP Over Time")
ggsave("figures/04_gdp_scatter.png", p4, width = 10, height = 6, dpi = 300)
cat("Saved: figures/04_gdp_scatter.png\n")

# ─────────────────────────────────────────────────────────────
# FIGURE 5: Density comparison across decades
# ─────────────────────────────────────────────────────────────
p5 <- df_viz %>%
  filter(Decade %in% c(1980, 2000, 2020), !is.na(log_GDP)) %>%
  ggplot(aes(x = log_GDP, color = factor(Decade))) +
  geom_density(linewidth = 1.2) +
  labs(x = "Log10(GDP)", y = "Density", color = "Decade",
       title = "GDP Distribution Shift Over Decades")
ggsave("figures/05_gdp_density_decades.png", p5, width = 8, height = 5, dpi = 300)
cat("Saved: figures/05_gdp_density_decades.png\n")

# List all generated figures
cat("\n─── All figures saved to ./figures/ ───\n")
list.files("figures")

Python Output

Saved: figures/01_gdp_distribution.png Saved: figures/02_gdp_boxplot_decades.png Saved: figures/03_gdp_trends.png and .pdf Saved: figures/04_gdp_scatter.png Saved: figures/05_gdp_density_decades.png ─── All figures saved to ./figures/ ─── 01_gdp_distribution.png 02_gdp_boxplot_decades.png 03_gdp_trends.pdf 03_gdp_trends.png 04_gdp_scatter.png 05_gdp_density_decades.png

Saved figure files:

figures/01_gdp_distribution.png

figures/02_gdp_boxplot_decades.png

figures/03_gdp_trends.png

figures/04_gdp_scatter.png

figures/05_gdp_density_decades.png

Stata Output

. capture mkdir "figures" . histogram log_gdp, kdensity frequency ... . graph export "figures/01_gdp_distribution.png", replace file figures/01_gdp_distribution.png saved as PNG format . graph box log_gdp, over(decade) ... . graph export "figures/02_gdp_boxplot_decades.png", replace file figures/02_gdp_boxplot_decades.png saved as PNG format . twoway (line gdp_billions year if countrycode == "USA") ... . graph export "figures/03_gdp_trends.png", replace file figures/03_gdp_trends.png saved as PNG format . graph export "figures/03_gdp_trends.pdf", replace file figures/03_gdp_trends.pdf saved as PDF format . scatter log_gdp year, msize(tiny) mcolor(%30) ... . graph export "figures/04_gdp_scatter.png", replace file figures/04_gdp_scatter.png saved as PNG format . twoway (kdensity log_gdp if decade == 1980) ... . graph export "figures/05_gdp_density_decades.png", replace file figures/05_gdp_density_decades.png saved as PNG format . dir "figures/" ───────────────────────────────────────────────────────────────── Directory: ./figures/ ───────────────────────────────────────────────────────────────── <file> 01_gdp_distribution.png 142 KB 17 Jan 2026 14:23 <file> 02_gdp_boxplot_decades.png 98 KB 17 Jan 2026 14:23 <file> 03_gdp_trends.pdf 45 KB 17 Jan 2026 14:23 <file> 03_gdp_trends.png 156 KB 17 Jan 2026 14:23 <file> 04_gdp_scatter.png 287 KB 17 Jan 2026 14:23 <file> 05_gdp_density_decades.png 76 KB 17 Jan 2026 14:23 ─────────────────────────────────────────────────────────────────

Saved figure files:

figures/01_gdp_distribution.png

figures/02_gdp_boxplot_decades.png

figures/03_gdp_trends.png

figures/04_gdp_scatter.png

figures/05_gdp_density_decades.png

R Output (RStudio Console)

> dir.create("figures", showWarnings = FALSE) > ggsave("figures/01_gdp_distribution.png", p1, ...) Saved: figures/01_gdp_distribution.png > ggsave("figures/02_gdp_boxplot_decades.png", p2, ...) Saved: figures/02_gdp_boxplot_decades.png > ggsave("figures/03_gdp_trends.png", p3, ...) > ggsave("figures/03_gdp_trends.pdf", p3, ...) Saved: figures/03_gdp_trends.png and .pdf > ggsave("figures/04_gdp_scatter.png", p4, ...) Saved: figures/04_gdp_scatter.png > ggsave("figures/05_gdp_density_decades.png", p5, ...) Saved: figures/05_gdp_density_decades.png ─── All figures saved to ./figures/ ─── > list.files("figures") [1] "01_gdp_distribution.png" [2] "02_gdp_boxplot_decades.png" [3] "03_gdp_trends.pdf" [4] "03_gdp_trends.png" [5] "04_gdp_scatter.png" [6] "05_gdp_density_decades.png"

Saved figure files:

figures/01_gdp_distribution.png

figures/02_gdp_boxplot_decades.png

figures/03_gdp_trends.png

figures/04_gdp_scatter.png

figures/05_gdp_density_decades.png

Step 6: Filtering Rows and Fixing Data Types

Before computing statistics, you often need to filter rows based on conditions and fix data types (e.g., columns that should be numeric but were read as strings). These operations are fundamental to every data exploration workflow.

Boolean Indexing (Filtering Rows by Condition)

In Python, you filter rows by creating a Boolean mask — a Series of True/False values — and passing it inside df[...]. Only rows where the condition is True are kept.

# Python: Boolean indexing
import pandas as pd

# Single condition: keep rows where year is 2020
df_2020 = df[df['year'] == 2020]

# How it works:
#   df['year'] == 2020  creates a Boolean Series (True/False per row)
#   df[...]             keeps only the True rows

# Combining conditions: use & (and) or | (or) with PARENTHESES
high_income = df[(df['region'] == 'North') & (df['income'] > 50000)]

# COMMON BUG: Python's 'and' keyword does NOT work with pandas Series
# df[df['region'] == 'North' and df['income'] > 50000]  # ERROR!
# Use & instead, with parentheses around each condition

# The .query() method is a readable alternative for complex filters
result = df.query('year >= 2010 & income > 0')

* Stata: Filtering with if conditions

* Filter with if (keeps data temporarily for that command)
summarize income if year == 2020

* Combine conditions with & (and) and | (or)
summarize income if region == "North" & income > 50000

* Permanently drop rows
keep if year >= 2010 & income > 0

* Or drop specific rows
drop if income <= 0 | missing(income)

# R: Filtering with dplyr::filter()
library(dplyr)

# Single condition
df_2020 <- df %>% filter(year == 2020)

# Combine conditions: comma means AND, | means OR
high_income <- df %>%
  filter(region == "North", income > 50000)

# COMMON BUG: | needs a full comparison on each side
# filter(region == 'North' | 'South')   # WRONG!
# filter(region == 'North' | region == 'South')  # Correct
# Or use %in%:
subset <- df %>%
  filter(region %in% c("North", "South"))

Fixing Data Types with pd.to_numeric()

When numeric columns are accidentally read as strings (due to commas in 45,000, currency symbols, or stray text), statistical functions like .mean() silently skip them. Always check df.dtypes and convert with pd.to_numeric().

# Python: Fixing data types

# Check column types — look for 'object' where you expect numbers
print(df.dtypes)

# Convert a column to numeric; errors='coerce' turns bad values to NaN
df['income'] = pd.to_numeric(df['income'], errors='coerce')

# Remove commas before converting (common in European/financial data)
df['income'] = df['income'].str.replace(',', '').astype(float)

# Drop rows where key columns are missing after conversion
df = df.dropna(subset=['income'])

* Stata: Fixing data types

* Check variable types
describe

* Convert string to numeric (destring removes non-numeric characters)
destring income, replace force

* Or remove specific characters first
replace income_str = subinstr(income_str, ",", "", .)
destring income_str, gen(income) force

* Drop observations with missing values
drop if missing(income)

# R: Fixing data types

# Check column types
str(df)

# Convert character to numeric (non-numeric values become NA)
df$income <- as.numeric(df$income)

# Remove commas first if needed
df$income <- as.numeric(gsub(",", "", df$income))

# Drop rows with missing values in specific columns
df <- df %>% filter(!is.na(income))

Complete "First Analysis" Script

Use this script as a guideline to produce one you can run on every new dataset before any analysis:

Click to expand complete script

# ============================================
# FIRST ANALYSIS SCRIPT - You can run something like this on every new dataset!
# ============================================
import pandas as pd
import matplotlib.pyplot as plt

# 1. LOAD DATA - Replace URL with your data source
url = "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv"
df = pd.read_csv(url)

print("="*50)
print("FIRST ANALYSIS REPORT")
print("="*50)

# 2. STRUCTURE - Understand dimensions and types
print(f"\nRows: {len(df):,} | Columns: {len(df.columns)}")
print(f"\nColumn types:\n{df.dtypes}")

# 3. MISSING VALUES - Critical for research quality
print("\n--- MISSING VALUES ---")
missing_pct = (df.isnull().sum() / len(df) * 100).round(1)
print(missing_pct[missing_pct > 0])

# 4. SUMMARY STATISTICS - Understand distributions
print("\n--- SUMMARY STATISTICS ---")
print(df.describe())

# 5. HISTOGRAMS - Visual inspection of all numeric columns
df.select_dtypes(include=['number']).hist(figsize=(12, 6), bins=30)
plt.tight_layout()
plt.savefig('first_analysis_histograms.png')
plt.show()

print("\n" + "="*50)
print("ANALYSIS COMPLETE")
print("="*50)

* ============================================
* FIRST ANALYSIS SCRIPT - You can run something like this on every new dataset!
* ============================================

* 1. LOAD DATA - Replace URL with your data source
clear all
import delimited "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv", clear

display "=================================================="
display "FIRST ANALYSIS REPORT"
display "=================================================="

* 2. STRUCTURE - Understand dimensions and types
describe, short
describe

* 3. MISSING VALUES - Critical for research quality
display "--- MISSING VALUES ---"
misstable summarize

* 4. SUMMARY STATISTICS - Understand distributions
display "--- SUMMARY STATISTICS ---"
summarize

* 5. HISTOGRAMS - Visual inspection of numeric columns
* Stata creates one histogram at a time; loop over numeric variables
ds, has(type numeric)
foreach var in `r(varlist)' {
    histogram `var', freq title("Distribution of `var'")
    graph export "first_analysis_`var'.png", replace
}

display "=================================================="
display "ANALYSIS COMPLETE"
display "=================================================="

# ============================================
# FIRST ANALYSIS SCRIPT - You can run something like this on every new dataset!
# ============================================
library(tidyverse)

# 1. LOAD DATA - Replace URL with your data source
url <- "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv"
df <- read_csv(url)

cat(strrep("=", 50), "\n")
cat("FIRST ANALYSIS REPORT\n")
cat(strrep("=", 50), "\n")

# 2. STRUCTURE - Understand dimensions and types
cat(sprintf("\nRows: %s | Columns: %d\n", format(nrow(df), big.mark=","), ncol(df)))
cat("\nColumn types:\n")
str(df)

# 3. MISSING VALUES - Critical for research quality
cat("\n--- MISSING VALUES ---\n")
missing_pct <- round(colSums(is.na(df)) / nrow(df) * 100, 1)
print(missing_pct[missing_pct > 0])

# 4. SUMMARY STATISTICS - Understand distributions
cat("\n--- SUMMARY STATISTICS ---\n")
summary(df)

# 5. HISTOGRAMS - Visual inspection of all numeric columns
numeric_df <- df %>% select_if(is.numeric)
numeric_long <- numeric_df %>% pivot_longer(cols = everything(), names_to = "variable", values_to = "value")

p <- ggplot(numeric_long, aes(x = value)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  facet_wrap(~variable, scales = "free") +
  theme_minimal() +
  labs(title = "Distribution of Numeric Variables")

ggsave("first_analysis_histograms.png", p, width = 12, height = 6)
print(p)

cat("\n", strrep("=", 50), "\n")
cat("ANALYSIS COMPLETE\n")
cat(strrep("=", 50), "\n")

ProTools ER1

Course Modules

3 Data Exploration

Learning Objectives

Course Research Project: Climate Policy & Economic Development

The Data Exploration Workflow

Step 1: Loading Data

What is "Loading Data"?

Step 2: Inspecting Structure

Step 3: Summary Statistics

Cross-Tabulations and Group-Level Statistics

Advanced GroupBy: Multi-Index Results and Group Transformations

Multi-Index GroupBy Results

Group Transformations: Adding Group Statistics Without Collapsing

Step 4-5: Visualization & Correlations

Step 6: Filtering Rows and Fixing Data Types

Boolean Indexing (Filtering Rows by Condition)

Fixing Data Types with pd.to_numeric()

Complete "First Analysis" Script

ProTools ER1 Assistant