2  Data Harnessing

~8 hours Acquisition, APIs, Web Scraping Beginner-Intermediate

Learning Objectives

  • Load data from CSV, Excel, Stata, and Parquet formats
  • Merge and reshape datasets for analysis
  • Access data programmatically through APIs
  • Understand the legal and ethical framework for web scraping
  • Extract data from web pages using HTML parsing

Course Research Project

Throughout this course, we build a complete research project: "Climate Vulnerability and Economic Growth." This module focuses on acquiring the data we'll later explore and analyze.

What is Data Harnessing?

Data harnessing is the process of acquiring data from various sources and preparing it for analysis. This is the critical first step in any research project—before you can explore, clean, or analyze data, you need to get it.

This module covers three methods of data acquisition, ordered from simplest to most complex:

Three Paths to Data

Path A: File Import

Data already collected and stored in files

  • CSV, Excel, Stata .dta files
  • Survey datasets (IPUMS, etc.)
  • Published replication data

Skills: Loading, merging, reshaping

Path B: APIs

Programmatic access to online databases

  • World Bank, FRED, Census
  • Structured, official access
  • Real-time or updated data

Skills: HTTP requests, JSON, authentication

Path C: Web Scraping

Extract data from web pages (last resort)

  • When no API/download exists
  • Requires legal awareness
  • HTML parsing techniques

Skills: HTML, BeautifulSoup/rvest, ethics

Module Structure

This module is divided into three subpages, one for each data acquisition method:

Start here. Learn to load, merge, and reshape data from common file formats:

  • Loading CSV, Excel, and Stata files
  • Merging datasets with different join types
  • Reshaping between wide and long formats

Fetch data programmatically from online databases:

  • Understanding what APIs are and how they work
  • Making HTTP requests to retrieve data
  • Handling JSON responses and authentication

Extract data from web pages when no other option exists:

  • Legal and ethical considerations (Terms of Service, robots.txt, copyright)
  • HTML basics and CSS selectors
  • BeautifulSoup (Python) and rvest (R) for parsing
  • Handling pagination and dynamic content

Which Subpage Should I Start With?

Your Situation Recommended Path
"I have a .csv/.xlsx/.dta file to work with" 2a: File Import
"I need World Bank / FRED / Census data" 2b: APIs
"The data I need is only on a website" 2c: Web Scraping (but check for APIs first!)
"I'm new to programming" 2a first, then 2b, then 2c
What About Exploring Data?

Once you've acquired your data, you'll want to understand it before cleaning or analyzing. Module 3: Data Exploration teaches you to build a "First Analysis" script that inspects, summarizes, and visualizes any dataset. This diagnostic step should happen before you start cleaning (Module 4) or formal analysis (Module 5).