2a Importing Data from Files

~3 hours CSV, Excel, Stata, Merging, Reshaping Beginner

Learning Objectives

Load data from CSV, Excel, Stata, and other common formats
Understand different file encodings and handle them correctly
Merge multiple datasets using key variables
Reshape data between wide and long formats

Common Data Formats in Economics Research

Before loading data, you need to understand what format it's in. Here are the most common formats you'll encounter:

CSV .csv

Comma-Separated Values. Plain text, universal compatibility.

Use: Sharing data, web downloads, exports

Excel .xlsx / .xls

Microsoft spreadsheet format. Can have multiple sheets.

Use: Business data, government reports

Stata .dta

Stata's native format. Preserves variable labels and types.

Use: Academic datasets, IPUMS, replication files

R Data .rds / .RData

R's native formats. .rds = single object, .RData = multiple objects.

Use: Saving R work, sharing with R users

JSON .json

JavaScript Object Notation. Hierarchical/nested data.

Use: API responses, web data, configuration

Fixed-width .txt / .dat

Columns defined by character positions. No delimiters.

Use: Census data, older government files

Tip: For a comprehensive guide on all file formats (data files, scripts, notebooks), see Module 0: File Formats Reference.

Loading Data from Files

What Happens When You "Load" Data?

Loading data reads the file from disk into your program's memory as a structured object (DataFrame in Python/R, dataset in Stata). The original file remains unchanged—you're working with a copy. Any changes you make only exist in memory until you explicitly save them.

Loading CSV Files

CSV is the most common format. Here's how to load it in each language:

# Python: Load CSV with pandas
import pandas as pd

# Basic CSV load from URL
df = pd.read_csv("https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv")

# Load from local file
df = pd.read_csv("data/my_data.csv")

# Handle common issues:
df = pd.read_csv("data.csv",
    encoding="utf-8",           # or "latin-1", "cp1252"
    sep=",",                      # delimiter (try ";" or "\t")
    na_values=["", "NA", ".", "-999"],  # missing value codes
    dtype={"id": str}             # force column types
)

# Verify the load
print(f"Loaded {len(df):,} rows, {len(df.columns)} columns")
print(df.head())

* Stata: Load CSV files

* Basic CSV load from URL
import delimited "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv", clear

* Load from local file
import delimited "data/my_data.csv", clear

* Handle common issues:
import delimited "data.csv", ///
    encoding(utf-8) ///           // or latin1, windows-1252
    delimiter(",") ///            // try ";" or tab
    varnames(1) ///               // first row = headers
    stringcols(1) ///              // force col 1 as string
    clear

* Verify the load
describe, short
list in 1/5

# R: Load CSV with tidyverse
library(tidyverse)

# Basic CSV load from URL
df <- read_csv("https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv")

# Load from local file
df <- read_csv("data/my_data.csv")

# Handle common issues:
df <- read_csv("data.csv",
    locale = locale(encoding = "UTF-8"),
    na = c("", "NA", ".", "-999"),
    col_types = cols(id = col_character())
)

# Verify the load
cat(sprintf("Loaded %s rows, %d columns\n",
    format(nrow(df), big.mark=","), ncol(df)))
head(df)

Python Output

Loaded 14,112 rows, 4 columns Country Name Country Code Year Value 0 Arab World ARB 1968 2.576068e+10 1 Arab World ARB 1969 2.843420e+10 2 Arab World ARB 1970 3.138565e+10 3 Arab World ARB 1971 3.587548e+10 4 Arab World ARB 1972 4.250944e+10

Stata Output

. import delimited "https://...gdp.csv", clear (encoding automatically selected: UTF-8) (4 vars, 14,112 obs) . describe, short Contains data Observations: 14,112 Variables: 4 . list in 1/5 +--------------------------------------------------+ | countryname countrycode year value | |--------------------------------------------------| 1. | Arab World ARB 1968 2.5761e+10 | 2. | Arab World ARB 1969 2.8434e+10 | 3. | Arab World ARB 1970 3.1386e+10 | 4. | Arab World ARB 1971 3.5875e+10 | 5. | Arab World ARB 1972 4.2509e+10 | +--------------------------------------------------+

R Output

Rows: 14112 Columns: 4 ── Column specification ──────────────────────────── Delimiter: "," chr (2): Country Name, Country Code dbl (2): Year, Value Loaded 14,112 rows, 4 columns # A tibble: 6 x 4 `Country Name` `Country Code` Year Value <chr> <chr> <dbl> <dbl> 1 Arab World ARB 1968 25760680000 2 Arab World ARB 1969 28434200000 3 Arab World ARB 1970 31385650000

Portable File Paths Across Operating Systems

Windows uses backslashes (\) as path separators, while Mac and Linux use forward slashes (/). Hardcoding one style makes your code break on other systems. Each language offers a function that automatically inserts the correct separator for the current OS:

import os

# os.path.join() inserts the correct separator for your OS
path = os.path.join('data', 'raw', 'gdp_panel.csv')
# On Mac/Linux: 'data/raw/gdp_panel.csv'
# On Windows:   'data\raw\gdp_panel.csv'

df = pd.read_csv(path)

# file.path() inserts the correct separator for your OS
path <- file.path('data', 'raw', 'gdp_panel.csv')
# On Mac/Linux: 'data/raw/gdp_panel.csv'
# On Windows:   'data\raw\gdp_panel.csv'

df <- read_csv(path)

# Compare with paste() — NOT portable:
# path <- paste('data', 'raw', 'gdp_panel.csv', sep = '/')
# This hardcodes '/' which may fail on Windows

Rule of thumb: Always use os.path.join() in Python and file.path() in R when constructing paths from folder and file name components. This is especially important for research code shared between coauthors who may use different operating systems.

Loading Excel and Stata Files

# Python: Load Excel and Stata files
import pandas as pd

# Excel files
df = pd.read_excel("data.xlsx")                      # First sheet
df = pd.read_excel("data.xlsx", sheet_name="Sheet2") # Specific sheet
df = pd.read_excel("data.xlsx", sheet_name=0)        # By index

# Load all sheets into a dictionary
all_sheets = pd.read_excel("data.xlsx", sheet_name=None)

# Stata files (preserves labels!)
df = pd.read_stata("data.dta")

# For large Stata files, use iterator
for chunk in pd.read_stata("big_data.dta", chunksize=10000):
    # Process each chunk
    pass

# JSON files (common from APIs)
df = pd.read_json("data.json")

* Stata: Load different file formats

* Stata native format (.dta)
use "data.dta", clear

* Load from URL
use "https://example.com/data.dta", clear

* Load only specific variables
use country year gdp using "data.dta", clear

* Load with condition (faster than loading all then dropping)
use "data.dta" if year >= 2000, clear

* Excel files
import excel "data.xlsx", firstrow clear
import excel "data.xlsx", sheet("Sheet2") firstrow clear

* View available sheets
import excel "data.xlsx", describe

# R: Load different file formats
library(tidyverse)
library(readxl)      # For Excel files
library(haven)       # For Stata files

# Excel files
df <- read_excel("data.xlsx")                       # First sheet
df <- read_excel("data.xlsx", sheet = "Sheet2")     # Specific sheet

# List available sheets
excel_sheets("data.xlsx")

# Load all sheets into a list
sheets <- excel_sheets("data.xlsx")
all_data <- map(sheets, ~read_excel("data.xlsx", sheet = .x))
names(all_data) <- sheets

# Stata files (preserves labels!)
df <- read_dta("data.dta")

# R native formats
df <- readRDS("data.rds")       # Single object
load("workspace.RData")         # Multiple objects

Python Output

Excel file loaded successfully Sheets found: ['Data', 'Metadata', 'Notes'] Stata file loaded successfully Loaded 50,000 rows, 25 columns Note: Variable labels preserved

Stata Output

. use "data.dta", clear (World Development Indicators) . describe, short Contains data from data.dta Observations: 50,000 Variables: 25 . import excel "data.xlsx", describe Sheet name: "Data" range: A1:Z1000 Sheet name: "Metadata" range: A1:D200 Sheet name: "Notes" range: A1:B50

R Output

> excel_sheets("data.xlsx") [1] "Data" "Metadata" "Notes" > df <- read_dta("data.dta") # A tibble: 50,000 x 25 Note: Variable labels accessible via attr(df$varname, "label")

Merging Datasets

What is Merging?

Merging (or joining) combines two datasets based on common key variables. For example, you might merge GDP data with population data using country code and year as keys. The key variables must uniquely identify observations in at least one of the datasets.

Types of Merges

Interactive Merge Visualizer

Select a merge type to see how the two datasets combine:

GDP Data (Left)

country	gdp
USA	21000
CHN	14700
DEU	3800

⟹

Population Data (Right)

country	pop
USA	331
CHN	1400
JPN	126

Result: Inner Join

country	gdp	pop
USA	21000	331
CHN	14700	1400

Inner join: Only rows with matching keys in BOTH datasets are kept (2 rows).

Type	Keeps	Use When
Inner	Only matching rows	You only want complete cases
Left	All from left + matches from right	Keep all your main data, add info where available
Right	All from right + matches from left	Opposite of left join
Outer (Full)	All rows from both	You want to preserve everything

Terminology Across Languages

Merging operations have different names in Python, Stata, and R. Use this table to translate:

Operation	Python (pandas)	Stata	R (dplyr)
Inner Join	`pd.merge(..., how='inner')`	`merge ..., keep(3)`	`inner_join()`
Left Join	`pd.merge(..., how='left')`	`merge ..., keep(1 3)`	`left_join()`
Right Join	`pd.merge(..., how='right')`	`merge ..., keep(2 3)`	`right_join()`
Full/Outer Join	`pd.merge(..., how='outer')`	`merge ...` (default)	`full_join()`
Match type	`validate='1:1'`, `'m:1'`, `'1:m'`	`1:1`, `m:1`, `1:m`, `m:m`	`relationship='one-to-one'`, etc.

Stata note: _merge values: 1 = master only, 2 = using only, 3 = matched in both.

# Python: Merging datasets with pandas
import pandas as pd

# Create example datasets
gdp = pd.DataFrame({
    'country': ['USA', 'CHN', 'DEU'],
    'year': [2020, 2020, 2020],
    'gdp': [21000, 14700, 3800]
})

pop = pd.DataFrame({
    'country': ['USA', 'CHN', 'JPN'],
    'year': [2020, 2020, 2020],
    'population': [331, 1400, 126]
})

# Inner join (only matches)
merged = pd.merge(gdp, pop, on=['country', 'year'])

# Left join (keep all from gdp)
merged_left = pd.merge(gdp, pop, on=['country', 'year'], how='left')

# Outer join (keep all)
merged_outer = pd.merge(gdp, pop, on=['country', 'year'], how='outer')

# When key columns have different names
merged = pd.merge(gdp, pop,
    left_on=['country', 'year'],
    right_on=['iso3', 'yr'])

# Check merge quality
print(f"GDP rows: {len(gdp)}, Pop rows: {len(pop)}, Merged: {len(merged)}")

* Stata: Merging datasets

* Load master dataset first
use "gdp.dta", clear

* Basic 1:1 merge (one observation per key in both)
merge 1:1 country year using "population.dta"

* Check merge results (ALWAYS DO THIS!)
tab _merge

* m:1 merge (many-to-one: many in master, one in using)
* Example: household data merged with region characteristics
merge m:1 region using "region_data.dta"

* 1:m merge (one-to-many)
merge 1:m country using "yearly_data.dta"

* Keep only matched observations
keep if _merge == 3
drop _merge

* Or assert all should match
assert _merge == 3
drop _merge

# R: Merging datasets with dplyr
library(tidyverse)

# Create example datasets
gdp <- tibble(
    country = c("USA", "CHN", "DEU"),
    year = c(2020, 2020, 2020),
    gdp = c(21000, 14700, 3800)
)

pop <- tibble(
    country = c("USA", "CHN", "JPN"),
    year = c(2020, 2020, 2020),
    population = c(331, 1400, 126)
)

# Inner join (only matches)
merged <- inner_join(gdp, pop, by = c("country", "year"))

# Left join (keep all from gdp)
merged_left <- left_join(gdp, pop, by = c("country", "year"))

# Full/outer join (keep all)
merged_full <- full_join(gdp, pop, by = c("country", "year"))

# When key columns have different names
merged <- left_join(gdp, pop, by = c("country" = "iso3", "year" = "yr"))

# Check merge quality
cat(sprintf("GDP: %d, Pop: %d, Merged: %d\n",
    nrow(gdp), nrow(pop), nrow(merged)))

Python Output

Inner join result: country year gdp population 0 USA 2020 21000 331 1 CHN 2020 14700 1400 Left join result: country year gdp population 0 USA 2020 21000 331.0 1 CHN 2020 14700 1400.0 2 DEU 2020 3800 NaN <-- no match GDP rows: 3, Pop rows: 3, Merged: 2

Stata Output

. merge 1:1 country year using "population.dta" Result Number of obs ----------------------------------------- Not matched 2 from master 1 (_merge==1) <-- DEU from using 1 (_merge==2) <-- JPN Matched 2 (_merge==3) ----------------------------------------- . tab _merge _merge | Freq. Percent Cum. ------------+----------------------------------- 1 | 1 25.00 25.00 2 | 1 25.00 50.00 3 | 2 50.00 100.00 ------------+----------------------------------- Total | 4 100.00

R Output

> merged # Inner join # A tibble: 2 x 4 country year gdp population <chr> <dbl> <dbl> <dbl> 1 USA 2020 21000 331 2 CHN 2020 14700 1400 > merged_left # Left join # A tibble: 3 x 4 country year gdp population 1 USA 2020 21000 331 2 CHN 2020 14700 1400 3 DEU 2020 3800 NA GDP: 3, Pop: 3, Merged: 2

Reshaping Data: Wide vs Long

Wide vs Long Format

Wide format: One row per unit, with separate columns for each time period (e.g., gdp_2018, gdp_2019, gdp_2020).

Long format: Multiple rows per unit, with a single column for the variable and another for time period. Most statistical analyses require long format.

Interactive Reshape Visualizer

Click the button to toggle between wide and long format:

Wide Format

country	gdp_2018	gdp_2019	gdp_2020
USA	20500	21400	21000
CHN	13900	14300	14700

2 rows × 4 columns — Each year is a separate column

Good for presentation, side-by-side comparison

country	year	gdp
USA	2018	20500
USA	2019	21400
USA	2020	21000
CHN	2018	13900
CHN	2019	14300
CHN	2020	14700

# Python: Reshaping with pandas
import pandas as pd

# Example wide data
wide = pd.DataFrame({
    'country': ['USA', 'CHN'],
    'gdp_2018': [20500, 13900],
    'gdp_2019': [21400, 14300],
    'gdp_2020': [21000, 14700]
})

# Wide to Long using melt()
long = pd.melt(wide,
    id_vars=['country'],           # Columns to keep
    var_name='year',               # Name for the new 'variable' column
    value_name='gdp'               # Name for the new 'value' column
)

# Clean up the year column (remove 'gdp_' prefix)
long['year'] = long['year'].str.replace('gdp_', '').astype(int)

# Long to Wide using pivot()
wide_again = long.pivot(
    index='country',               # Row identifier
    columns='year',                # Create columns from this
    values='gdp'                   # Values to fill cells
).reset_index()

print("Long format:")
print(long)

* Stata: Reshaping data

* Example: Start with wide data
* Variables: country, gdp2018, gdp2019, gdp2020

* Wide to Long
reshape long gdp, i(country) j(year)

* i() = identifier variables (rows)
* j() = new variable to hold the suffix (2018, 2019, 2020)
* 'gdp' = stub of the variables to reshape

* Long to Wide
reshape wide gdp, i(country) j(year)

* Verify the reshape
list in 1/6

* Multiple variables at once
* If you have gdp2018 gdp2019 pop2018 pop2019:
reshape long gdp pop, i(country) j(year)

# R: Reshaping with tidyr
library(tidyverse)

# Example wide data
wide <- tibble(
    country = c("USA", "CHN"),
    gdp_2018 = c(20500, 13900),
    gdp_2019 = c(21400, 14300),
    gdp_2020 = c(21000, 14700)
)

# Wide to Long using pivot_longer()
long <- wide %>%
  pivot_longer(
    cols = starts_with("gdp_"),   # Columns to pivot
    names_to = "year",             # New column for names
    values_to = "gdp",             # New column for values
    names_prefix = "gdp_"          # Remove this prefix
  ) %>%
  mutate(year = as.integer(year))

# Long to Wide using pivot_wider()
wide_again <- long %>%
  pivot_wider(
    names_from = year,             # Column to spread
    values_from = gdp,             # Values to fill
    names_prefix = "gdp_"          # Add prefix to column names
  )

print(long)

Python Output

Long format: country year gdp 0 USA 2018 20500 1 CHN 2018 13900 2 USA 2019 21400 3 CHN 2019 14300 4 USA 2020 21000 5 CHN 2020 14700

Stata Output

. reshape long gdp, i(country) j(year) (j = 2018 2019 2020) Data Wide -> Long ----------------------------------------------------------------------------- Number of observations 2 -> 6 Number of variables 4 -> 3 j variable (3 values) -> year xij variables: gdp2018 gdp2019 gdp2020 -> gdp ----------------------------------------------------------------------------- . list +-----------------------+ | country year gdp | |-----------------------| 1. | USA 2018 20500 | 2. | USA 2019 21400 | 3. | USA 2020 21000 | 4. | CHN 2018 13900 | 5. | CHN 2019 14300 | 6. | CHN 2020 14700 | +-----------------------+

R Output

> print(long) # A tibble: 6 x 3 country year gdp <chr> <int> <dbl> 1 USA 2018 20500 2 USA 2019 21400 3 USA 2020 21000 4 CHN 2018 13900 5 CHN 2019 14300 6 CHN 2020 14700

Exercise

Exercise 2a.1: Data Import and Merging

Practice the core data import skills. Complete these tasks:

Load a CSV file into a dataframe
Perform a merge/join operation on two datasets
Reshape data from wide to long format (or vice versa)

2a Importing Data from Files

Learning Objectives

Course Research Project

Common Data Formats in Economics Research

CSV .csv

Excel .xlsx / .xls

Stata .dta

R Data .rds / .RData

JSON .json

Fixed-width .txt / .dat

Loading Data from Files

What Happens When You "Load" Data?

Loading CSV Files

Portable File Paths Across Operating Systems

Loading Excel and Stata Files

Merging Datasets

What is Merging?

Types of Merges

Interactive Merge Visualizer

Terminology Across Languages

Reshaping Data: Wide vs Long

Wide vs Long Format

Interactive Reshape Visualizer

Exercise

Exercise 2a.1: Data Import and Merging

Solution

ProTools ER1 Assistant

ProTools ER1

Course Modules

2a Importing Data from Files

Learning Objectives

Course Research Project

Common Data Formats in Economics Research

CSV .csv

Excel .xlsx / .xls

Stata .dta

R Data .rds / .RData

JSON .json

Fixed-width .txt / .dat

Loading Data from Files

What Happens When You "Load" Data?

Loading CSV Files

Portable File Paths Across Operating Systems

Loading Excel and Stata Files

Merging Datasets

What is Merging?

Types of Merges

Interactive Merge Visualizer

Terminology Across Languages

Reshaping Data: Wide vs Long

Wide vs Long Format

Interactive Reshape Visualizer

Exercise

Exercise 2a.1: Data Import and Merging

Solution

ProTools ER1 Assistant