3  Data Exploration

~4 hours Loading, Understanding, Exploring Beginner-Intermediate

Learning Objectives

  • Load data from CSV, Excel, and Stata formats
  • Understand dataset structure before any analysis
  • Build a comprehensive "First Analysis" script
  • Explore distributions, ranges, and correlations

Course Research Project: Climate Policy & Economic Development

Throughout this course, we'll build a complete research project worthy of publication. Our question:

"How does climate vulnerability affect economic growth across countries, and do international climate agreements moderate this relationship?"

This project uses real World Bank data, involves API calls, text analysis of policy documents, and causal inference methods. Each module advances the project.

A Critical Mistake I Often see

I witnessed many students—(and occasionally also senior researchers...)—make a costly mistake: jumping straight into analysis without first getting to know their data.

The consequences: running regressions on variables with 80% missing values, drawing conclusions from data entry errors, missing obvious patterns, etc. In this module I'll share a more disciplined approach.

The Data Exploration Workflow

1
Load

Import data

2
Inspect

Structure & types

3
Summarize

Distributions

4
Visualize

Patterns

5
Correlate

Relationships

Step 1: Loading Data

What is "Loading Data"?

Loading data means reading information from a file into your program's memory. The file stays unchanged—you create a working copy. Different formats (CSV, Excel, Stata) require different commands.

For our research project, we'll use World Bank development indicators. Click Run to see the output for each language:

# COMPLETE EXAMPLE: Load World Bank data directly from URL
# This demonstrates loading CSV data from the internet

# Import pandas - the main data manipulation library in Python
import pandas as pd

# World Bank provides CSV downloads - we use GDP data
# This URL points to a publicly available dataset on GitHub
url = "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv"

# read_csv() loads CSV data into a DataFrame
# It automatically detects column names and data types
df = pd.read_csv(url)

# Verify the load worked by checking dimensions
# len(df) returns number of rows, len(df.columns) returns number of columns
print(f"Loaded {len(df):,} rows and {len(df.columns)} columns")
print(f"Columns: {list(df.columns)}")
* COMPLETE EXAMPLE: Load World Bank data
* Stata can import CSV files directly from URLs

* Clear any existing data from memory
clear all

* Import CSV file from URL into Stata
* 'delimited' means comma-separated values
* 'clear' option replaces any data in memory
import delimited "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv", clear

* Verify the load - describe shows variable info
* 'short' option gives a compact summary
describe, short

* List first 5 observations to preview the data
list in 1/5
# COMPLETE EXAMPLE: Load World Bank data directly from URL
# R can read CSV files from URLs natively

# Load tidyverse - a collection of R packages for data science
# Includes readr (for reading data), dplyr (for manipulation), ggplot2 (for visualization)
library(tidyverse)

# Define the URL pointing to our CSV data
url <- "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv"

# read_csv() from readr package loads the data into a tibble
# A tibble is a modern version of R's data.frame
df <- read_csv(url)

# Print dimensions and column names
# nrow() = number of rows, ncol() = number of columns
cat(sprintf("Loaded %s rows and %d columns\n",
    format(nrow(df), big.mark=","), ncol(df)))
print(names(df))
Python Output (VS Code)
Loaded 14,112 rows and 4 columns Columns: ['Country Name', 'Country Code', 'Year', 'Value']
Stata Output
. import delimited "https://...gdp.csv", clear (encoding automatically selected: UTF-8) (4 vars, 14,112 obs) . describe, short Contains data Observations: 14,112 Variables: 4 . list in 1/5 +--------------------------------------------------+ | countryname countrycode year value | |--------------------------------------------------| 1. | Arab World ARB 1968 2.5761e+10 | 2. | Arab World ARB 1969 2.8434e+10 | 3. | Arab World ARB 1970 3.1386e+10 | 4. | Arab World ARB 1971 3.5875e+10 | 5. | Arab World ARB 1972 4.2509e+10 | +--------------------------------------------------+
R Output (RStudio Console)
Rows: 14112 Columns: 4 ── Column specification ────────────────────────────── Delimiter: "," chr (2): Country Name, Country Code dbl (2): Year, Value Loaded 14,112 rows and 4 columns [1] "Country Name" "Country Code" "Year" "Value"

Step 2: Inspecting Structure

Before analyzing data, we need to understand its structure: what variables exist, their types, and any missing values. Knowing the data structure will prevent many mistakes!

# Continuing from above - inspect the data structure
# This is crucial before any analysis!

# info() shows data types and non-null counts for each column
# This reveals missing values and helps identify type issues
print("=== DATA STRUCTURE ===")
print(df.info())

# head() shows first rows - always visually inspect!
# Default is 5 rows, but you can specify: head(10)
print("\n=== FIRST 5 ROWS ===")
print(df.head())
* Continuing - inspect structure
* Always check your data before analysis!

* Full variable description including type, format, labels
describe

* First observations - visual inspection is crucial
list in 1/5
# Continuing - inspect structure
# Understanding structure prevents analysis mistakes

# str() shows structure: variable types, sample values
# This is R's equivalent of pandas info()
str(df)

# head() shows first 6 rows by default
head(df)
Python Output
=== DATA STRUCTURE === <class 'pandas.core.frame.DataFrame'> RangeIndex: 14112 entries, 0 to 14111 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country Name 14112 non-null object 1 Country Code 14112 non-null object 2 Year 14112 non-null int64 3 Value 10585 non-null float64 === FIRST 5 ROWS === Country Name Country Code Year Value 0 Arab World ARB 1968 2.576068e+10 1 Arab World ARB 1969 2.843420e+10 2 Arab World ARB 1970 3.138565e+10 3 Arab World ARB 1971 3.587548e+10 4 Arab World ARB 1972 4.250944e+10
Stata Output
. describe Contains data Observations: 14,112 Variables: 4 Variable Storage Display Value name type format label ─────────────────────────────────────────── countryname str52 %52s countrycode str3 %9s year int %10.0g value double %10.0g . list in 1/5 +--------------------------------------------------+ | countryname countrycode year value | |--------------------------------------------------| 1. | Arab World ARB 1968 2.5761e+10 | 2. | Arab World ARB 1969 2.8434e+10 | 3. | Arab World ARB 1970 3.1386e+10 | 4. | Arab World ARB 1971 3.5875e+10 | 5. | Arab World ARB 1972 4.2509e+10 | +--------------------------------------------------+
R Output (RStudio Console)
> str(df) tibble [14,112 x 4] (S3: tbl_df/tbl/data.frame) $ Country Name: chr "Arab World" "Arab World" "Arab World" ... $ Country Code: chr "ARB" "ARB" "ARB" "ARB" ... $ Year : num 1968 1969 1970 1971 1972 ... $ Value : num 2.58e+10 2.84e+10 3.14e+10 3.59e+10 4.25e+10 ... > head(df) # A tibble: 6 x 4 `Country Name` `Country Code` Year Value <chr> <chr> <dbl> <dbl> 1 Arab World ARB 1968 25760680000 2 Arab World ARB 1969 28434200000 3 Arab World ARB 1970 31385650000 4 Arab World ARB 1971 35875480000 5 Arab World ARB 1972 42509440000 6 Arab World ARB 1973 54832410000

Step 3: Summary Statistics

Now we calculate key statistics to understand our data's distribution and identify potential issues.

# Summary statistics - understand your distributions

# describe() gives count, mean, std, min, 25%, 50%, 75%, max
# Only works on numeric columns by default
print("=== SUMMARY STATISTICS ===")
print(df.describe())

# Missing values analysis - CRITICAL for research!
# isnull() returns True/False, sum() counts True values
print("\n=== MISSING VALUES ===")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(1)
print(pd.DataFrame({'Missing': missing, 'Percent': missing_pct}))
* Summary statistics
* 'detail' option gives more info: percentiles, skewness, kurtosis
summarize, detail

* Missing values analysis
* misstable shows patterns of missing data
misstable summarize
# Summary statistics
# summary() gives min, 1st quartile, median, mean, 3rd quartile, max
summary(df)

# Missing values count per column
# is.na() returns TRUE/FALSE, colSums() sums each column
cat("\nMissing values:\n")
colSums(is.na(df))
Python Output
=== SUMMARY STATISTICS === Year Value count 14112.00 10585.00 mean 1993.50 3.21e+11 std 15.59 1.42e+12 min 1960.00 2.69e+06 25% 1976.75 2.23e+09 50% 1993.50 1.45e+10 75% 2010.25 9.89e+10 max 2027.00 2.55e+13 === MISSING VALUES === Missing Percent Country Name 0 0.0 Country Code 0 0.0 Year 0 0.0 Value 3527 25.0
Stata Output
. summarize, detail year ───────────────────────────────────────────────────── Percentiles Smallest 1% 1968 1960 5% 1970 1960 10% 1972 1960 Obs 14,112 25% 1977 1960 Sum of wgt. 14,112 50% 1994 Mean 1993.5 Largest Std. dev. 15.589 75% 2010 2027 90% 2016 2027 Variance 243.02 95% 2019 2027 Skewness 0 99% 2024 2027 Kurtosis 1.8 . misstable summarize Obs<. ───────── Obs. Variable Unique missing name values Min Max ───────────────────────────────────────────────── 3,527 value 10585 . .
R Output (RStudio Console)
> summary(df) Country Name Country Code Year Value Length:14112 Length:14112 Min. :1960 Min. :2.69e+06 Class :character Class :character 1st Qu.:1977 1st Qu.:2.23e+09 Mode :character Mode :character Median :1994 Median :1.45e+10 Mean :1994 Mean :3.21e+11 3rd Qu.:2010 3rd Qu.:9.89e+10 Max. :2027 Max. :2.55e+13 NA's :3527 Missing values: Country Name Country Code Year Value 0 0 0 3527
Red Flag Detected!

25% of GDP values are missing. Before any analysis, we need to understand why. Are they missing for small countries? Recent years? This is exactly what the "First Analysis" approach catches early.

Step 4-5: Visualization & Correlations

Visual inspection reveals patterns that summary statistics may miss. I personally like to have code produce figures files rather than just displaying them inline. This also makes them easy to include in reports for coauthors, as well as later in papers and presentations. As usual, go with your mouse over the code below to inspect the commands that save the files (the code underlined with dashed line).

# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os

# Create output directory for figures
os.makedirs('figures', exist_ok=True)

# Prepare data
df_viz = df.copy()
df_viz['GDP_billions'] = df_viz['Value'] / 1e9
df_viz['log_GDP'] = np.log10(df_viz['Value'].replace(0, np.nan))
df_viz['Decade'] = (df_viz['Year'] // 10) * 10

# ─────────────────────────────────────────────────────────────
# FIGURE 1: Distribution of GDP (histogram + density)
# ─────────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(8, 5))
sns.histplot(df_viz['log_GDP'].dropna(), kde=True, bins=40, ax=ax)
ax.set_xlabel('Log10(GDP in USD)')
ax.set_ylabel('Frequency')
ax.set_title('Distribution of GDP (log scale)')
fig.savefig('figures/01_gdp_distribution.png', dpi=300, bbox_inches='tight')
plt.close()
print("Saved: figures/01_gdp_distribution.png")

# ─────────────────────────────────────────────────────────────
# FIGURE 2: Box plot by decade
# ─────────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(10, 5))
sns.boxplot(data=df_viz, x='Decade', y='log_GDP', ax=ax)
ax.set_ylabel('Log10(GDP)')
ax.set_title('GDP Distribution by Decade')
ax.tick_params(axis='x', rotation=45)
fig.savefig('figures/02_gdp_boxplot_decades.png', dpi=300, bbox_inches='tight')
plt.close()
print("Saved: figures/02_gdp_boxplot_decades.png")

# ─────────────────────────────────────────────────────────────
# FIGURE 3: Time series for major economies
# ─────────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(10, 6))
countries = ['USA', 'CHN', 'DEU', 'BRA', 'IND']
for country in countries:
    data = df_viz[df_viz['Country Code'] == country]
    ax.plot(data['Year'], data['GDP_billions'], label=country, linewidth=2)
ax.set_xlabel('Year')
ax.set_ylabel('GDP (Billion USD)')
ax.set_title('GDP Trends: Major Economies')
ax.legend()
ax.grid(True, alpha=0.3)
fig.savefig('figures/03_gdp_trends.png', dpi=300, bbox_inches='tight')
fig.savefig('figures/03_gdp_trends.pdf', bbox_inches='tight')
plt.close()
print("Saved: figures/03_gdp_trends.png and .pdf")

# ─────────────────────────────────────────────────────────────
# FIGURE 4: Scatter plot - all countries over time
# ─────────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(10, 6))
sns.scatterplot(data=df_viz.dropna(), x='Year', y='log_GDP', alpha=0.3, s=15, ax=ax)
ax.set_ylabel('Log10(GDP)')
ax.set_title('All Countries: GDP Over Time')
fig.savefig('figures/04_gdp_scatter.png', dpi=300, bbox_inches='tight')
plt.close()
print("Saved: figures/04_gdp_scatter.png")

# ─────────────────────────────────────────────────────────────
# FIGURE 5: Density comparison across decades
# ─────────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(8, 5))
for decade in [1980, 2000, 2020]:
    data = df_viz[(df_viz['Decade'] == decade) & (df_viz['log_GDP'].notna())]
    sns.kdeplot(data['log_GDP'], label=f'{decade}s', linewidth=2, ax=ax)
ax.set_xlabel('Log10(GDP)')
ax.set_ylabel('Density')
ax.set_title('GDP Distribution Shift Over Decades')
ax.legend(title='Decade')
fig.savefig('figures/05_gdp_density_decades.png', dpi=300, bbox_inches='tight')
plt.close()
print("Saved: figures/05_gdp_density_decades.png")

# List all generated figures
print("\n─── All figures saved to ./figures/ ───")
for f in sorted(os.listdir('figures')):
    print(f"  {f}")
* Create output directory for figures
capture mkdir "figures"

* Prepare data
gen gdp_billions = value / 1e9
gen log_gdp = log10(value)
gen decade = int(year/10)*10

* ─────────────────────────────────────────────────────────────
* FIGURE 1: Distribution of GDP (histogram + density)
* ─────────────────────────────────────────────────────────────
histogram log_gdp, kdensity frequency ///
    title("Distribution of GDP (log scale)") ///
    xtitle("Log10(GDP in USD)")
graph export "figures/01_gdp_distribution.png", replace

* ─────────────────────────────────────────────────────────────
* FIGURE 2: Box plot by decade
* ─────────────────────────────────────────────────────────────
graph box log_gdp, over(decade) ///
    title("GDP Distribution by Decade") ///
    ytitle("Log10(GDP)")
graph export "figures/02_gdp_boxplot_decades.png", replace

* ─────────────────────────────────────────────────────────────
* FIGURE 3: Time series for major economies
* ─────────────────────────────────────────────────────────────
twoway (line gdp_billions year if countrycode == "USA", lwidth(medium)) ///
       (line gdp_billions year if countrycode == "CHN", lwidth(medium)) ///
       (line gdp_billions year if countrycode == "DEU", lwidth(medium)) ///
       (line gdp_billions year if countrycode == "BRA", lwidth(medium)) ///
       (line gdp_billions year if countrycode == "IND", lwidth(medium)), ///
       legend(label(1 "USA") label(2 "China") label(3 "Germany") ///
              label(4 "Brazil") label(5 "India")) ///
       title("GDP Trends: Major Economies") ///
       ytitle("GDP (Billion USD)") xtitle("Year")
graph export "figures/03_gdp_trends.png", replace
graph export "figures/03_gdp_trends.pdf", replace

* ─────────────────────────────────────────────────────────────
* FIGURE 4: Scatter plot - all countries over time
* ─────────────────────────────────────────────────────────────
scatter log_gdp year, msize(tiny) mcolor(%30) ///
    title("All Countries: GDP Over Time") ///
    ytitle("Log10(GDP)") xtitle("Year")
graph export "figures/04_gdp_scatter.png", replace

* ─────────────────────────────────────────────────────────────
* FIGURE 5: Density comparison across decades
* ─────────────────────────────────────────────────────────────
twoway (kdensity log_gdp if decade == 1980, lwidth(medium)) ///
       (kdensity log_gdp if decade == 2000, lwidth(medium)) ///
       (kdensity log_gdp if decade == 2020, lwidth(medium)), ///
       legend(label(1 "1980s") label(2 "2000s") label(3 "2020s")) ///
       title("GDP Distribution Shift Over Decades") ///
       xtitle("Log10(GDP)") ytitle("Density")
graph export "figures/05_gdp_density_decades.png", replace

* List all generated figures
dir "figures/"
# Load required libraries
library(tidyverse)

# Create output directory for figures
dir.create("figures", showWarnings = FALSE)

# Prepare data
df_viz <- df %>%
  mutate(
    GDP_billions = Value / 1e9,
    log_GDP = log10(Value),
    Decade = (Year %/% 10) * 10
  )

# ─────────────────────────────────────────────────────────────
# FIGURE 1: Distribution of GDP (histogram + density)
# ─────────────────────────────────────────────────────────────
p1 <- ggplot(df_viz, aes(x = log_GDP)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40,
                 fill = "steelblue", alpha = 0.7) +
  geom_density(color = "darkred", linewidth = 1) +
  labs(x = "Log10(GDP in USD)", y = "Density",
       title = "Distribution of GDP (log scale)")
ggsave("figures/01_gdp_distribution.png", p1, width = 8, height = 5, dpi = 300)
cat("Saved: figures/01_gdp_distribution.png\n")

# ─────────────────────────────────────────────────────────────
# FIGURE 2: Box plot by decade
# ─────────────────────────────────────────────────────────────
p2 <- df_viz %>%
  filter(!is.na(log_GDP)) %>%
  ggplot(aes(x = factor(Decade), y = log_GDP)) +
  geom_boxplot(fill = "lightblue", outlier.alpha = 0.3) +
  labs(x = "Decade", y = "Log10(GDP)",
       title = "GDP Distribution by Decade")
ggsave("figures/02_gdp_boxplot_decades.png", p2, width = 10, height = 5, dpi = 300)
cat("Saved: figures/02_gdp_boxplot_decades.png\n")

# ─────────────────────────────────────────────────────────────
# FIGURE 3: Time series for major economies
# ─────────────────────────────────────────────────────────────
countries <- c("USA", "CHN", "DEU", "BRA", "IND")
p3 <- df_viz %>%
  filter(`Country Code` %in% countries) %>%
  ggplot(aes(x = Year, y = GDP_billions, color = `Country Code`)) +
  geom_line(linewidth = 1.2) +
  labs(y = "GDP (Billion USD)", title = "GDP Trends: Major Economies") +
  theme(legend.position = "bottom")
ggsave("figures/03_gdp_trends.png", p3, width = 10, height = 6, dpi = 300)
ggsave("figures/03_gdp_trends.pdf", p3, width = 10, height = 6)
cat("Saved: figures/03_gdp_trends.png and .pdf\n")

# ─────────────────────────────────────────────────────────────
# FIGURE 4: Scatter plot - all countries over time
# ─────────────────────────────────────────────────────────────
p4 <- df_viz %>%
  filter(!is.na(log_GDP)) %>%
  ggplot(aes(x = Year, y = log_GDP)) +
  geom_point(alpha = 0.2, size = 0.8) +
  labs(y = "Log10(GDP)", title = "All Countries: GDP Over Time")
ggsave("figures/04_gdp_scatter.png", p4, width = 10, height = 6, dpi = 300)
cat("Saved: figures/04_gdp_scatter.png\n")

# ─────────────────────────────────────────────────────────────
# FIGURE 5: Density comparison across decades
# ─────────────────────────────────────────────────────────────
p5 <- df_viz %>%
  filter(Decade %in% c(1980, 2000, 2020), !is.na(log_GDP)) %>%
  ggplot(aes(x = log_GDP, color = factor(Decade))) +
  geom_density(linewidth = 1.2) +
  labs(x = "Log10(GDP)", y = "Density", color = "Decade",
       title = "GDP Distribution Shift Over Decades")
ggsave("figures/05_gdp_density_decades.png", p5, width = 8, height = 5, dpi = 300)
cat("Saved: figures/05_gdp_density_decades.png\n")

# List all generated figures
cat("\n─── All figures saved to ./figures/ ───\n")
list.files("figures")
Python Output
Saved: figures/01_gdp_distribution.png Saved: figures/02_gdp_boxplot_decades.png Saved: figures/03_gdp_trends.png and .pdf Saved: figures/04_gdp_scatter.png Saved: figures/05_gdp_density_decades.png ─── All figures saved to ./figures/ ─── 01_gdp_distribution.png 02_gdp_boxplot_decades.png 03_gdp_trends.pdf 03_gdp_trends.png 04_gdp_scatter.png 05_gdp_density_decades.png

Saved figure files:

figures/01_gdp_distribution.png
GDP Distribution Histogram
figures/02_gdp_boxplot_decades.png
GDP Boxplot by Decade
figures/03_gdp_trends.png
GDP Trends
figures/04_gdp_scatter.png
GDP Scatter
figures/05_gdp_density_decades.png
GDP Density by Decade
Stata Output
. capture mkdir "figures" . histogram log_gdp, kdensity frequency ... . graph export "figures/01_gdp_distribution.png", replace file figures/01_gdp_distribution.png saved as PNG format . graph box log_gdp, over(decade) ... . graph export "figures/02_gdp_boxplot_decades.png", replace file figures/02_gdp_boxplot_decades.png saved as PNG format . twoway (line gdp_billions year if countrycode == "USA") ... . graph export "figures/03_gdp_trends.png", replace file figures/03_gdp_trends.png saved as PNG format . graph export "figures/03_gdp_trends.pdf", replace file figures/03_gdp_trends.pdf saved as PDF format . scatter log_gdp year, msize(tiny) mcolor(%30) ... . graph export "figures/04_gdp_scatter.png", replace file figures/04_gdp_scatter.png saved as PNG format . twoway (kdensity log_gdp if decade == 1980) ... . graph export "figures/05_gdp_density_decades.png", replace file figures/05_gdp_density_decades.png saved as PNG format . dir "figures/" ───────────────────────────────────────────────────────────────── Directory: ./figures/ ───────────────────────────────────────────────────────────────── <file> 01_gdp_distribution.png 142 KB 17 Jan 2026 14:23 <file> 02_gdp_boxplot_decades.png 98 KB 17 Jan 2026 14:23 <file> 03_gdp_trends.pdf 45 KB 17 Jan 2026 14:23 <file> 03_gdp_trends.png 156 KB 17 Jan 2026 14:23 <file> 04_gdp_scatter.png 287 KB 17 Jan 2026 14:23 <file> 05_gdp_density_decades.png 76 KB 17 Jan 2026 14:23 ─────────────────────────────────────────────────────────────────

Saved figure files:

figures/01_gdp_distribution.png
GDP Distribution Histogram
figures/02_gdp_boxplot_decades.png
GDP Boxplot by Decade
figures/03_gdp_trends.png
GDP Trends
figures/04_gdp_scatter.png
GDP Scatter
figures/05_gdp_density_decades.png
GDP Density by Decade
R Output (RStudio Console)
> dir.create("figures", showWarnings = FALSE) > ggsave("figures/01_gdp_distribution.png", p1, ...) Saved: figures/01_gdp_distribution.png > ggsave("figures/02_gdp_boxplot_decades.png", p2, ...) Saved: figures/02_gdp_boxplot_decades.png > ggsave("figures/03_gdp_trends.png", p3, ...) > ggsave("figures/03_gdp_trends.pdf", p3, ...) Saved: figures/03_gdp_trends.png and .pdf > ggsave("figures/04_gdp_scatter.png", p4, ...) Saved: figures/04_gdp_scatter.png > ggsave("figures/05_gdp_density_decades.png", p5, ...) Saved: figures/05_gdp_density_decades.png ─── All figures saved to ./figures/ ─── > list.files("figures") [1] "01_gdp_distribution.png" [2] "02_gdp_boxplot_decades.png" [3] "03_gdp_trends.pdf" [4] "03_gdp_trends.png" [5] "04_gdp_scatter.png" [6] "05_gdp_density_decades.png"

Saved figure files:

figures/01_gdp_distribution.png
GDP Distribution Histogram
figures/02_gdp_boxplot_decades.png
GDP Boxplot by Decade
figures/03_gdp_trends.png
GDP Trends
figures/04_gdp_scatter.png
GDP Scatter
figures/05_gdp_density_decades.png
GDP Density by Decade

Complete "First Analysis" Script

Use this script as guideline to propduce one you can run on every new dataset before any analysis:

Click to expand complete Python script
# ============================================
# FIRST ANALYSIS SCRIPT - You can run something like this on every new dataset!
# ============================================
import pandas as pd
import matplotlib.pyplot as plt

# 1. LOAD DATA - Replace URL with your data source
url = "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv"
df = pd.read_csv(url)

print("="*50)
print("FIRST ANALYSIS REPORT")
print("="*50)

# 2. STRUCTURE - Understand dimensions and types
print(f"\nRows: {len(df):,} | Columns: {len(df.columns)}")
print(f"\nColumn types:\n{df.dtypes}")

# 3. MISSING VALUES - Critical for research quality
print("\n--- MISSING VALUES ---")
missing_pct = (df.isnull().sum() / len(df) * 100).round(1)
print(missing_pct[missing_pct > 0])

# 4. SUMMARY STATISTICS - Understand distributions
print("\n--- SUMMARY STATISTICS ---")
print(df.describe())

# 5. HISTOGRAMS - Visual inspection of all numeric columns
df.select_dtypes(include=['number']).hist(figsize=(12, 6), bins=30)
plt.tight_layout()
plt.savefig('first_analysis_histograms.png')
plt.show()

print("\n" + "="*50)
print("ANALYSIS COMPLETE")
print("="*50)