2a  Importing Data from Files

~3 hours CSV, Excel, Stata, Merging, Reshaping Beginner

Learning Objectives

  • Load data from CSV, Excel, Stata, and other common formats
  • Understand different file encodings and handle them correctly
  • Merge multiple datasets using key variables
  • Reshape data between wide and long formats

Course Research Project

In this submodule, we'll load our first datasets for the "Climate Vulnerability and Economic Growth" project—World Bank GDP data and country indicators.

Common Data Formats in Economics Research

Before loading data, you need to understand what format it's in. Here are the most common formats you'll encounter:

CSV .csv

Comma-Separated Values. Plain text, universal compatibility.

Use: Sharing data, web downloads, exports

Excel .xlsx / .xls

Microsoft spreadsheet format. Can have multiple sheets.

Use: Business data, government reports

Stata .dta

Stata's native format. Preserves variable labels and types.

Use: Academic datasets, IPUMS, replication files

R Data .rds / .RData

R's native formats. .rds = single object, .RData = multiple objects.

Use: Saving R work, sharing with R users

JSON .json

JavaScript Object Notation. Hierarchical/nested data.

Use: API responses, web data, configuration

Fixed-width .txt / .dat

Columns defined by character positions. No delimiters.

Use: Census data, older government files

Tip: For a comprehensive guide on all file formats (data files, scripts, notebooks), see Module 0: File Formats Reference.

Loading Data from Files

What Happens When You "Load" Data?

Loading data reads the file from disk into your program's memory as a structured object (DataFrame in Python/R, dataset in Stata). The original file remains unchanged—you're working with a copy. Any changes you make only exist in memory until you explicitly save them.

Loading CSV Files

CSV is the most common format. Here's how to load it in each language:

# Python: Load CSV with pandas
import pandas as pd

# Basic CSV load from URL
df = pd.read_csv("https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv")

# Load from local file
df = pd.read_csv("data/my_data.csv")

# Handle common issues:
df = pd.read_csv("data.csv",
    encoding="utf-8",           # or "latin-1", "cp1252"
    sep=",",                      # delimiter (try ";" or "\t")
    na_values=["", "NA", ".", "-999"],  # missing value codes
    dtype={"id": str}             # force column types
)

# Verify the load
print(f"Loaded {len(df):,} rows, {len(df.columns)} columns")
print(df.head())
* Stata: Load CSV files

* Basic CSV load from URL
import delimited "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv", clear

* Load from local file
import delimited "data/my_data.csv", clear

* Handle common issues:
import delimited "data.csv", ///
    encoding(utf-8) ///           // or latin1, windows-1252
    delimiter(",") ///            // try ";" or tab
    varnames(1) ///               // first row = headers
    stringcols(1) ///              // force col 1 as string
    clear

* Verify the load
describe, short
list in 1/5
# R: Load CSV with tidyverse
library(tidyverse)

# Basic CSV load from URL
df <- read_csv("https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv")

# Load from local file
df <- read_csv("data/my_data.csv")

# Handle common issues:
df <- read_csv("data.csv",
    locale = locale(encoding = "UTF-8"),
    na = c("", "NA", ".", "-999"),
    col_types = cols(id = col_character())
)

# Verify the load
cat(sprintf("Loaded %s rows, %d columns\n",
    format(nrow(df), big.mark=","), ncol(df)))
head(df)
Python Output
Loaded 14,112 rows, 4 columns Country Name Country Code Year Value 0 Arab World ARB 1968 2.576068e+10 1 Arab World ARB 1969 2.843420e+10 2 Arab World ARB 1970 3.138565e+10 3 Arab World ARB 1971 3.587548e+10 4 Arab World ARB 1972 4.250944e+10
Stata Output
. import delimited "https://...gdp.csv", clear (encoding automatically selected: UTF-8) (4 vars, 14,112 obs) . describe, short Contains data Observations: 14,112 Variables: 4 . list in 1/5 +--------------------------------------------------+ | countryname countrycode year value | |--------------------------------------------------| 1. | Arab World ARB 1968 2.5761e+10 | 2. | Arab World ARB 1969 2.8434e+10 | 3. | Arab World ARB 1970 3.1386e+10 | 4. | Arab World ARB 1971 3.5875e+10 | 5. | Arab World ARB 1972 4.2509e+10 | +--------------------------------------------------+
R Output
Rows: 14112 Columns: 4 ── Column specification ──────────────────────────── Delimiter: "," chr (2): Country Name, Country Code dbl (2): Year, Value Loaded 14,112 rows, 4 columns # A tibble: 6 x 4 `Country Name` `Country Code` Year Value <chr> <chr> <dbl> <dbl> 1 Arab World ARB 1968 25760680000 2 Arab World ARB 1969 28434200000 3 Arab World ARB 1970 31385650000

Loading Excel and Stata Files

# Python: Load Excel and Stata files
import pandas as pd

# Excel files
df = pd.read_excel("data.xlsx")                      # First sheet
df = pd.read_excel("data.xlsx", sheet_name="Sheet2") # Specific sheet
df = pd.read_excel("data.xlsx", sheet_name=0)        # By index

# Load all sheets into a dictionary
all_sheets = pd.read_excel("data.xlsx", sheet_name=None)

# Stata files (preserves labels!)
df = pd.read_stata("data.dta")

# For large Stata files, use iterator
for chunk in pd.read_stata("big_data.dta", chunksize=10000):
    # Process each chunk
    pass

# JSON files (common from APIs)
df = pd.read_json("data.json")
* Stata: Load different file formats

* Stata native format (.dta)
use "data.dta", clear

* Load from URL
use "https://example.com/data.dta", clear

* Load only specific variables
use country year gdp using "data.dta", clear

* Load with condition (faster than loading all then dropping)
use "data.dta" if year >= 2000, clear

* Excel files
import excel "data.xlsx", firstrow clear
import excel "data.xlsx", sheet("Sheet2") firstrow clear

* View available sheets
import excel "data.xlsx", describe
# R: Load different file formats
library(tidyverse)
library(readxl)      # For Excel files
library(haven)       # For Stata files

# Excel files
df <- read_excel("data.xlsx")                       # First sheet
df <- read_excel("data.xlsx", sheet = "Sheet2")     # Specific sheet

# List available sheets
excel_sheets("data.xlsx")

# Load all sheets into a list
sheets <- excel_sheets("data.xlsx")
all_data <- map(sheets, ~read_excel("data.xlsx", sheet = .x))
names(all_data) <- sheets

# Stata files (preserves labels!)
df <- read_dta("data.dta")

# R native formats
df <- readRDS("data.rds")       # Single object
load("workspace.RData")         # Multiple objects
Python Output
Excel file loaded successfully Sheets found: ['Data', 'Metadata', 'Notes'] Stata file loaded successfully Loaded 50,000 rows, 25 columns Note: Variable labels preserved
Stata Output
. use "data.dta", clear (World Development Indicators) . describe, short Contains data from data.dta Observations: 50,000 Variables: 25 . import excel "data.xlsx", describe Sheet name: "Data" range: A1:Z1000 Sheet name: "Metadata" range: A1:D200 Sheet name: "Notes" range: A1:B50
R Output
> excel_sheets("data.xlsx") [1] "Data" "Metadata" "Notes" > df <- read_dta("data.dta") # A tibble: 50,000 x 25 Note: Variable labels accessible via attr(df$varname, "label")

Merging Datasets

What is Merging?

Merging (or joining) combines two datasets based on common key variables. For example, you might merge GDP data with population data using country code and year as keys. The key variables must uniquely identify observations in at least one of the datasets.

Types of Merges

Interactive Merge Visualizer

Select a merge type to see how the two datasets combine:

GDP Data (Left)
countrygdp
USA21000
CHN14700
DEU3800
Population Data (Right)
countrypop
USA331
CHN1400
JPN126
Result: Inner Join
countrygdppop
USA21000331
CHN147001400

Inner join: Only rows with matching keys in BOTH datasets are kept (2 rows).

Type Keeps Use When
Inner Only matching rows You only want complete cases
Left All from left + matches from right Keep all your main data, add info where available
Right All from right + matches from left Opposite of left join
Outer (Full) All rows from both You want to preserve everything

Terminology Across Languages

Merging operations have different names in Python, Stata, and R. Use this table to translate:

Operation Python (pandas) Stata R (dplyr)
Inner Join pd.merge(..., how='inner') merge ..., keep(3) inner_join()
Left Join pd.merge(..., how='left') merge ..., keep(1 3) left_join()
Right Join pd.merge(..., how='right') merge ..., keep(2 3) right_join()
Full/Outer Join pd.merge(..., how='outer') merge ... (default) full_join()
Match type validate='1:1', 'm:1', '1:m' 1:1, m:1, 1:m, m:m relationship='one-to-one', etc.

Stata note: _merge values: 1 = master only, 2 = using only, 3 = matched in both.

# Python: Merging datasets with pandas
import pandas as pd

# Create example datasets
gdp = pd.DataFrame({
    'country': ['USA', 'CHN', 'DEU'],
    'year': [2020, 2020, 2020],
    'gdp': [21000, 14700, 3800]
})

pop = pd.DataFrame({
    'country': ['USA', 'CHN', 'JPN'],
    'year': [2020, 2020, 2020],
    'population': [331, 1400, 126]
})

# Inner join (only matches)
merged = pd.merge(gdp, pop, on=['country', 'year'])

# Left join (keep all from gdp)
merged_left = pd.merge(gdp, pop, on=['country', 'year'], how='left')

# Outer join (keep all)
merged_outer = pd.merge(gdp, pop, on=['country', 'year'], how='outer')

# When key columns have different names
merged = pd.merge(gdp, pop,
    left_on=['country', 'year'],
    right_on=['iso3', 'yr'])

# Check merge quality
print(f"GDP rows: {len(gdp)}, Pop rows: {len(pop)}, Merged: {len(merged)}")
* Stata: Merging datasets

* Load master dataset first
use "gdp.dta", clear

* Basic 1:1 merge (one observation per key in both)
merge 1:1 country year using "population.dta"

* Check merge results (ALWAYS DO THIS!)
tab _merge

* m:1 merge (many-to-one: many in master, one in using)
* Example: household data merged with region characteristics
merge m:1 region using "region_data.dta"

* 1:m merge (one-to-many)
merge 1:m country using "yearly_data.dta"

* Keep only matched observations
keep if _merge == 3
drop _merge

* Or assert all should match
assert _merge == 3
drop _merge
# R: Merging datasets with dplyr
library(tidyverse)

# Create example datasets
gdp <- tibble(
    country = c("USA", "CHN", "DEU"),
    year = c(2020, 2020, 2020),
    gdp = c(21000, 14700, 3800)
)

pop <- tibble(
    country = c("USA", "CHN", "JPN"),
    year = c(2020, 2020, 2020),
    population = c(331, 1400, 126)
)

# Inner join (only matches)
merged <- inner_join(gdp, pop, by = c("country", "year"))

# Left join (keep all from gdp)
merged_left <- left_join(gdp, pop, by = c("country", "year"))

# Full/outer join (keep all)
merged_full <- full_join(gdp, pop, by = c("country", "year"))

# When key columns have different names
merged <- left_join(gdp, pop, by = c("country" = "iso3", "year" = "yr"))

# Check merge quality
cat(sprintf("GDP: %d, Pop: %d, Merged: %d\n",
    nrow(gdp), nrow(pop), nrow(merged)))
Python Output
Inner join result: country year gdp population 0 USA 2020 21000 331 1 CHN 2020 14700 1400 Left join result: country year gdp population 0 USA 2020 21000 331.0 1 CHN 2020 14700 1400.0 2 DEU 2020 3800 NaN <-- no match GDP rows: 3, Pop rows: 3, Merged: 2
Stata Output
. merge 1:1 country year using "population.dta" Result Number of obs ----------------------------------------- Not matched 2 from master 1 (_merge==1) <-- DEU from using 1 (_merge==2) <-- JPN Matched 2 (_merge==3) ----------------------------------------- . tab _merge _merge | Freq. Percent Cum. ------------+----------------------------------- 1 | 1 25.00 25.00 2 | 1 25.00 50.00 3 | 2 50.00 100.00 ------------+----------------------------------- Total | 4 100.00
R Output
> merged # Inner join # A tibble: 2 x 4 country year gdp population <chr> <dbl> <dbl> <dbl> 1 USA 2020 21000 331 2 CHN 2020 14700 1400 > merged_left # Left join # A tibble: 3 x 4 country year gdp population 1 USA 2020 21000 331 2 CHN 2020 14700 1400 3 DEU 2020 3800 NA GDP: 3, Pop: 3, Merged: 2

Reshaping Data: Wide vs Long

Wide vs Long Format

Wide format: One row per unit, with separate columns for each time period (e.g., gdp_2018, gdp_2019, gdp_2020).

Long format: Multiple rows per unit, with a single column for the variable and another for time period. Most statistical analyses require long format.

Interactive Reshape Visualizer

Click the button to toggle between wide and long format:

Wide Format
country gdp_2018 gdp_2019 gdp_2020
USA205002140021000
CHN139001430014700

2 rows × 4 columns — Each year is a separate column

Good for presentation, side-by-side comparison

# Python: Reshaping with pandas
import pandas as pd

# Example wide data
wide = pd.DataFrame({
    'country': ['USA', 'CHN'],
    'gdp_2018': [20500, 13900],
    'gdp_2019': [21400, 14300],
    'gdp_2020': [21000, 14700]
})

# Wide to Long using melt()
long = pd.melt(wide,
    id_vars=['country'],           # Columns to keep
    var_name='year',               # Name for the new 'variable' column
    value_name='gdp'               # Name for the new 'value' column
)

# Clean up the year column (remove 'gdp_' prefix)
long['year'] = long['year'].str.replace('gdp_', '').astype(int)

# Long to Wide using pivot()
wide_again = long.pivot(
    index='country',               # Row identifier
    columns='year',                # Create columns from this
    values='gdp'                   # Values to fill cells
).reset_index()

print("Long format:")
print(long)
* Stata: Reshaping data

* Example: Start with wide data
* Variables: country, gdp2018, gdp2019, gdp2020

* Wide to Long
reshape long gdp, i(country) j(year)

* i() = identifier variables (rows)
* j() = new variable to hold the suffix (2018, 2019, 2020)
* 'gdp' = stub of the variables to reshape

* Long to Wide
reshape wide gdp, i(country) j(year)

* Verify the reshape
list in 1/6

* Multiple variables at once
* If you have gdp2018 gdp2019 pop2018 pop2019:
reshape long gdp pop, i(country) j(year)
# R: Reshaping with tidyr
library(tidyverse)

# Example wide data
wide <- tibble(
    country = c("USA", "CHN"),
    gdp_2018 = c(20500, 13900),
    gdp_2019 = c(21400, 14300),
    gdp_2020 = c(21000, 14700)
)

# Wide to Long using pivot_longer()
long <- wide %>%
  pivot_longer(
    cols = starts_with("gdp_"),   # Columns to pivot
    names_to = "year",             # New column for names
    values_to = "gdp",             # New column for values
    names_prefix = "gdp_"          # Remove this prefix
  ) %>%
  mutate(year = as.integer(year))

# Long to Wide using pivot_wider()
wide_again <- long %>%
  pivot_wider(
    names_from = year,             # Column to spread
    values_from = gdp,             # Values to fill
    names_prefix = "gdp_"          # Add prefix to column names
  )

print(long)
Python Output
Long format: country year gdp 0 USA 2018 20500 1 CHN 2018 13900 2 USA 2019 21400 3 CHN 2019 14300 4 USA 2020 21000 5 CHN 2020 14700
Stata Output
. reshape long gdp, i(country) j(year) (j = 2018 2019 2020) Data Wide -> Long ----------------------------------------------------------------------------- Number of observations 2 -> 6 Number of variables 4 -> 3 j variable (3 values) -> year xij variables: gdp2018 gdp2019 gdp2020 -> gdp ----------------------------------------------------------------------------- . list +-----------------------+ | country year gdp | |-----------------------| 1. | USA 2018 20500 | 2. | USA 2019 21400 | 3. | USA 2020 21000 | 4. | CHN 2018 13900 | 5. | CHN 2019 14300 | 6. | CHN 2020 14700 | +-----------------------+
R Output
> print(long) # A tibble: 6 x 3 country year gdp <chr> <int> <dbl> 1 USA 2018 20500 2 USA 2019 21400 3 USA 2020 21000 4 CHN 2018 13900 5 CHN 2019 14300 6 CHN 2020 14700

Exercise

Exercise 2a.1: Data Import and Merging

Practice the core data import skills. Complete these tasks:

  1. Load a CSV file into a dataframe
  2. Perform a merge/join operation on two datasets
  3. Reshape data from wide to long format (or vice versa)