4  Data Cleaning

~6 hours Missing Data, Outliers, Validation Intermediate

Learning Objectives

  • Identify and handle missing values appropriately
  • Detect and treat outliers
  • Clean and manipulate string variables
  • Work with dates and time series data
  • Validate data quality and consistency

4.1 Handling Missing Data

Missing data is ubiquitous in real-world datasets. Understanding why data is missing is crucial for choosing the appropriate handling strategy. There are three main types of missingness:

Type Description Example
MCAR Missing Completely At Random Survey responses lost due to computer glitch
MAR Missing At Random (conditional on observed) Income missing more often for older respondents
MNAR Missing Not At Random High earners refuse to report income

As usual, use your mouse to hover over the code below to see explanations of each command and understand how to analyse missingness patterns and handle missing data appropriately in Python, Stata, and R:

# Python: Handling missing data
import pandas as pd
import numpy as np

# Check for missing values
print(df.isnull().sum())           # Count by column
print(df.isnull().mean() * 100)   # Percentage missing

# Visualize missing patterns
import missingno as msno
msno.matrix(df)
msno.heatmap(df)  # Correlation of missingness

# Drop missing values
df_complete = df.dropna()                    # Drop any row with NA
df_subset = df.dropna(subset=['income'])    # Drop if income is missing

# Fill missing values
df['income'].fillna(df['income'].mean(), inplace=True)  # Mean imputation
df['income'].fillna(df['income'].median(), inplace=True) # Median imputation
df['income'].fillna(method='ffill', inplace=True)      # Forward fill

# Group-wise imputation
df['income'] = df.groupby('education')['income'].transform(
    lambda x: x.fillna(x.median())
)

# Create missing indicator
df['income_missing'] = df['income'].isnull().astype(int)
* Stata: Handling missing data

* Check for missing values
misstable summarize          // Summary of missing patterns
misstable patterns           // Missing patterns
mdesc                        // Requires: ssc install mdesc

* Count missing by variable
foreach var of varlist * {
    count if missing(`var')
    display "`var': " r(N) " missing"
}

* Drop observations with missing values
drop if missing(income)      // Drop if income missing
keep if !missing(income)     // Alternative

* Replace missing with mean
summarize income
replace income = r(mean) if missing(income)

* Group-wise imputation
bysort education: egen income_med = median(income)
replace income = income_med if missing(income)

* Create missing indicator
gen income_missing = missing(income)

* Multiple imputation (advanced)
mi set mlong
mi register imputed income
mi impute regress income age education, add(20)
# R: Handling missing data
library(tidyverse)
library(naniar)    # For missing data visualization

# Check for missing values
sum(is.na(df))                    # Total missing
colSums(is.na(df))                # By column
colMeans(is.na(df)) * 100        # Percentage

# Visualize missing patterns (naniar)
vis_miss(df)
gg_miss_var(df)
gg_miss_upset(df)

# Drop missing values
df_complete <- df %>% drop_na()              # All complete cases
df_subset <- df %>% drop_na(income)        # Drop if income missing

# Replace missing values
df <- df %>%
  mutate(
    income = replace_na(income, mean(income, na.rm = TRUE))
  )

# Group-wise imputation
df <- df %>%
  group_by(education) %>%
  mutate(
    income = ifelse(is.na(income), median(income, na.rm = TRUE), income)
  ) %>%
  ungroup()

# Create missing indicator
df <- df %>%
  mutate(income_missing = as.integer(is.na(income)))
Python Output
# Count missing by column id 0 name 3 income 147 education 23 age 0 dtype: int64 # Percentage missing id 0.00 name 0.30 income 14.70 education 2.30 age 0.00 dtype: float64
Stata Output
. misstable summarize Obs<. +------------------- | Unique Variable | Obs=. Obs>. Obs<. values Min Max -------------+-------------------------------------------- income | 147 0 142 . . education | 23 0 15 . . name | 3 0 997 . . . mi impute regress income age education, add(20) Univariate imputation Imputations = 20 Regression added = 20 Imputed: m=1 through m=20 updated = 0 | Observations per m |---------------------------------------------- Variable | Complete Incomplete Imputed | Total -------------+-----------------------------------+---------- income | 853 147 147 | 1000
R Output
> sum(is.na(df)) [1] 173 > colSums(is.na(df)) id name income education age 0 3 147 23 0 > colMeans(is.na(df)) * 100 id name income education age 0.0 0.3 14.7 2.3 0.0

4.2 Detecting and Treating Outliers

Outliers can significantly affect your analysis. Before removing them, always investigate why they exist—they might be data errors, or they might be genuine extreme values that contain important information. The key question is whether an observation represents a valid data point or a measurement error.

A common approach is the z-score method, which measures how many standard deviations an observation is from the mean. Values with |z| > 3 are typically flagged as potential outliers, as they fall outside 99.7% of a normal distribution (Iglewicz & Hoaglin, 1993). For non-normal data, the interquartile range (IQR) method is more robust: values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR are flagged (Tukey, 1977).

# Python: Detecting and treating outliers
import pandas as pd
import numpy as np
from scipy import stats

# Method 1: Z-score (for approximately normal data)
z_scores = np.abs(stats.zscore(df['income']))
df_no_outliers = df[z_scores < 3]
print(f"Removed {len(df) - len(df_no_outliers)} outliers using z-score method")

# Method 2: IQR (more robust, works for non-normal data)
Q1 = df['income'].quantile(0.25)
Q3 = df['income'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_iqr_clean = df[(df['income'] >= lower_bound) & (df['income'] <= upper_bound)]

# Winsorization: Cap extreme values instead of removing
df['income_winsorized'] = df['income'].clip(lower=lower_bound, upper=upper_bound)

# Create outlier indicator for analysis
df['is_outlier'] = ((df['income'] < lower_bound) | (df['income'] > upper_bound)).astype(int)
print(f"Outliers flagged: {df['is_outlier'].sum()}")
* Stata: Detecting and treating outliers

* Method 1: Z-score (for approximately normal data)
summarize income
gen z_income = (income - r(mean)) / r(sd)
gen is_outlier_z = (abs(z_income) > 3)
count if is_outlier_z == 1

* Method 2: IQR (more robust, works for non-normal data)
summarize income, detail
scalar Q1 = r(p25)
scalar Q3 = r(p75)
scalar IQR = Q3 - Q1
scalar lower = Q1 - 1.5 * IQR
scalar upper = Q3 + 1.5 * IQR
gen is_outlier_iqr = (income < scalar(lower)) | (income > scalar(upper))

* Winsorization: Cap extreme values instead of removing
gen income_winsorized = income
replace income_winsorized = scalar(lower) if income < scalar(lower)
replace income_winsorized = scalar(upper) if income > scalar(upper)

* Drop outliers (use with caution)
* drop if is_outlier_iqr == 1  // Uncomment to actually drop

* Tabulate outlier counts
tab is_outlier_iqr
# R: Detecting and treating outliers
library(tidyverse)

# Method 1: Z-score (for approximately normal data)
df <- df %>%
  mutate(z_income = scale(income))
df_no_outliers <- df %>%
  filter(abs(z_income) < 3)
cat("Removed", nrow(df) - nrow(df_no_outliers), "outliers using z-score method\n")

# Method 2: IQR (more robust, works for non-normal data)
Q1 <- quantile(df$income, 0.25, na.rm = TRUE)
Q3 <- quantile(df$income, 0.75, na.rm = TRUE)
IQR_val <- IQR(df$income, na.rm = TRUE)
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
df_iqr_clean <- df %>%
  filter(between(income, lower_bound, upper_bound))

# Winsorization: Cap extreme values instead of removing
df <- df %>%
  mutate(income_winsorized = pmin(pmax(income, lower_bound), upper_bound))

# Create outlier indicator for analysis
df <- df %>%
  mutate(is_outlier = as.integer(income < lower_bound | income > upper_bound))
cat("Outliers flagged:", sum(df$is_outlier), "\n")
Python Output
Removed 12 outliers using z-score method Outliers flagged: 47
Stata Output
. count if is_outlier_z == 1 12 . tab is_outlier_iqr is_outlier_ | iqr | Freq. Percent Cum. ------------+----------------------------------- 0 | 953 95.30 95.30 1 | 47 4.70 100.00 ------------+----------------------------------- Total | 1,000 100.00
R Output
Removed 12 outliers using z-score method Outliers flagged: 47
When to Keep vs. Remove Outliers

Keep outliers when they represent genuine extreme values (e.g., Bill Gates in an income study) or when your research question specifically concerns extremes. Remove or cap outliers when they are data entry errors, measurement failures, or when robust estimates are needed. Always report how you handled outliers and test whether your conclusions change.

4.3 String Manipulation

Text data often requires cleaning—standardizing case, removing whitespace, extracting patterns, and more. Economic datasets frequently contain messy string variables: inconsistent country names, occupation codes with extra characters, or addresses that need parsing.

# Python: String manipulation
import pandas as pd

# Case standardization
df['country'] = df['country'].str.lower()
df['state_code'] = df['state_code'].str.upper()
df['name'] = df['name'].str.title()

# Remove whitespace
df['state'] = df['state'].str.strip()
df['address'] = df['address'].str.replace(r'\s+', ' ', regex=True)

# Extract patterns with regex
df['zip_code'] = df['address'].str.extract(r'(\d{5})')
df[['state_abbr', 'zip']] = df['address'].str.extract(r'([A-Z]{2})\s+(\d{5})')

# Replace values for standardization
df['country'] = df['country'].replace({
    'usa': 'united states',
    'u.s.a': 'united states',
    'uk': 'united kingdom'
})

# Split strings into multiple columns
df[['first_name', 'last_name']] = df['full_name'].str.split(' ', n=1, expand=True)

# Check for patterns
df['is_corporation'] = df['company'].str.contains(r'inc|corp', case=False, regex=True)
* Stata: String manipulation

* Case standardization
replace country = lower(country)
replace state_code = upper(state_code)
replace name = proper(name)

* Remove whitespace
replace state = strtrim(state)
replace address = stritrim(address)

* Extract patterns with regex
gen zip_code = regexs(0) if regexm(address, "[0-9]{5}")

* Extract with capture groups
gen state_abbr = regexs(1) if regexm(address, "([A-Z]{2})\s+[0-9]{5}")

* Replace values for standardization
replace country = "united states" if inlist(country, "usa", "u.s.a", "u.s.")
replace country = "united kingdom" if country == "uk"

* Split strings into multiple variables
split full_name, parse(" ") gen(name_part)
rename name_part1 first_name
rename name_part2 last_name

* Check for patterns
gen is_corporation = regexm(lower(company), "inc|corp")

* Substring extraction
gen state_abbr2 = substr(state_full, 1, 2)
# R: String manipulation
library(tidyverse)  # includes stringr

# Case standardization
df <- df %>%
  mutate(country = str_to_lower(country))
df <- df %>%
  mutate(state_code = str_to_upper(state_code))
df <- df %>%
  mutate(name = str_to_title(name))

# Remove whitespace
df <- df %>%
  mutate(state = str_trim(state))
df <- df %>%
  mutate(address = str_squish(address))

# Extract patterns with regex
df <- df %>%
  mutate(zip_code = str_extract(address, "\\d{5}"))

# Extract with named capture groups
address_parts <- str_match(df$address, "([A-Z]{2})\\s+(\\d{5})")
df$state_abbr <- address_parts[, 2]
df$zip <- address_parts[, 3]

# Replace values for standardization
df <- df %>%
  mutate(country = case_when(
    country %in% c("usa", "u.s.a", "u.s.") ~ "united states",
    country == "uk" ~ "united kingdom",
    TRUE ~ country
  ))

# Split strings into multiple columns
df <- df %>%
  separate(full_name, into = c("first_name", "last_name"), sep = " ", extra = "merge")

# Check for patterns
df <- df %>%
  mutate(is_corporation = str_detect(company, regex("inc|corp", ignore_case = TRUE)))
Python Output
# Example DataFrame after string cleaning country name first_name last_name is_corporation 0 united states John Doe John Doe True 1 united kingdom Jane Smith Jane Smith False 2 united states Bob Wilson Bob Wilson True
Stata Output
. list country name first_name last_name is_corporation in 1/3 +----------------------------------------------------------------+ | country name first_name last_name is_corp | |----------------------------------------------------------------| 1. | united states John Doe John Doe 1 | 2. | united kingdom Jane Smith Jane Smith 0 | 3. | united states Bob Wilson Bob Wilson 1 | +----------------------------------------------------------------+
R Output
# A tibble: 3 × 5 country name first_name last_name is_corporation <chr> <chr> <chr> <chr> <lgl> 1 united states John Doe John Doe TRUE 2 united kingdom Jane Smith Jane Smith FALSE 3 united states Bob Wilson Bob Wilson TRUE

4.4 Working with Dates

Date handling varies across languages but follows similar principles: parsing string dates, extracting components, and performing date arithmetic. Properly formatted dates are essential for time series analysis, event studies, and panel data methods.

# Python: Working with dates
import pandas as pd

# Parse string dates to datetime
df['date'] = pd.to_datetime(df['date_string'])

# Specify format for non-standard dates
df['date'] = pd.to_datetime(df['date_string'], format='%d/%m/%Y')

# Extract date components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['quarter'] = df['date'].dt.quarter
df['day_of_week'] = df['date'].dt.day_name()

# Date arithmetic
df['days_since_hire'] = (pd.Timestamp.today() - df['hire_date']).dt.days
df['deadline'] = df['start_date'] + pd.Timedelta(days=30)

# Filter by date range
df_recent = df[df['date'] >= '2020-01-01']
df_2023 = df[df['date'].between('2023-01-01', '2023-12-31')]

# Create period indicators
df['year_quarter'] = df['date'].dt.to_period('Q')

# Handle missing or invalid dates
df['date'] = pd.to_datetime(df['date_string'], errors='coerce')
print(f"Invalid dates: {df['date'].isna().sum()}")
* Stata: Working with dates

* Parse string dates to Stata date format
gen date = date(date_string, "DMY")
format date %td

* Different date formats
gen date_ymd = date(date_string, "YMD")
gen date_mdy = date(date_string, "MDY")

* Extract date components
gen year = year(date)
gen month = month(date)
gen quarter = quarter(date)
gen day_of_week = dow(date)

* Date arithmetic (Stata dates are just numbers)
gen days_since_hire = today() - hire_date
gen deadline = start_date + 30

* Create period indicators
gen year_quarter = yq(year, quarter)
format year_quarter %tq
gen year_month = ym(year, month)
format year_month %tm

* Filter by date range
keep if date >= mdy(1, 1, 2020)

* Handle datetime (with time component)
gen double datetime = clock(datetime_string, "YMDhms")
format datetime %tc
# R: Working with dates
library(tidyverse)
library(lubridate)  # Part of tidyverse, explicit load for clarity

# Parse string dates to Date objects
df <- df %>%
  mutate(date = ymd(date_string))

# Different date formats
df <- df %>%
  mutate(date_euro = dmy(date_string_euro))
df <- df %>%
  mutate(date_us = mdy(date_string_us))

# Extract date components
df <- df %>%
  mutate(
    year = year(date),
    month = month(date),
    quarter = quarter(date),
    day_of_week = wday(date, label = TRUE)
  )

# Date arithmetic
df <- df %>%
  mutate(days_since_hire = as.numeric(difftime(Sys.Date(), hire_date, units = "days")))
df <- df %>%
  mutate(deadline = start_date + days(30))

# Filter by date range
df_recent <- df %>%
  filter(date >= ymd("2020-01-01"))
df_2023 <- df %>%
  filter(between(date, ymd("2023-01-01"), ymd("2023-12-31")))

# Create period indicators
df <- df %>%
  mutate(quarter_start = floor_date(date, "quarter"))

# Handle parsing failures
df <- df %>%
  mutate(date = ymd(date_string, quiet = TRUE))
cat("Invalid dates:", sum(is.na(df$date)), "\n")
Python Output
# Date columns after processing date year month quarter day_of_week days_since_hire deadline 0 2024-01-15 2024 1 1 Monday 365 2024-02-14 1 2024-03-22 2024 3 1 Friday 299 2024-04-21 2 2023-12-01 2023 12 4 Friday 421 2023-12-31 Invalid dates: 3
Stata Output
. list date year month quarter day_of_week in 1/3 +--------------------------------------------+ | date year month quarter dow | |--------------------------------------------| 1. | 15jan2024 2024 1 1 1 | 2. | 22mar2024 2024 3 1 5 | 3. | 01dec2023 2023 12 4 5 | +--------------------------------------------+
R Output
# A tibble: 3 × 6 date year month quarter day_of_week days_since_hire <date> <dbl> <dbl> <int> <ord> <dbl> 1 2024-01-15 2024 1 1 Mon 365 2 2024-03-22 2024 3 1 Fri 299 3 2023-12-01 2023 12 4 Fri 421 Invalid dates: 3
Common Date Format Codes

Python/R: %Y = 4-digit year, %m = month (01-12), %d = day (01-31), %H = hour, %M = minute, %S = second
Stata: Y = year, M = month, D = day, h = hour, m = minute, s = second
Example: "2024-01-15" = %Y-%m-%d (Python/R) or "YMD" (Stata)

4.5 Data Validation

Before analysis, always validate your data for consistency and logical errors. Validation catches problems that summary statistics might miss: impossible values, logical inconsistencies, and data entry errors. Build validation checks into your workflow and run them after every data transformation.

# Python: Data validation
import pandas as pd

# Check value ranges (impossible values)
assert (df['age'] >= 0).all() & (df['age'] <= 120).all(), "Invalid ages found"
assert (df['income'] >= 0).all(), "Negative income values found"

# Find and report violations (without stopping)
invalid_age = (df['age'] < 0) | (df['age'] > 120)
print(f"Rows with invalid age: {invalid_age.sum()}")
print(df[invalid_age])  # Show the problematic rows

# Check categorical values (unexpected categories)
valid_genders = ['M', 'F', 'Other']
invalid_gender = ~df['gender'].isin(valid_genders)
print(f"Invalid gender values: {df.loc[invalid_gender, 'gender'].unique()}")

# Check logical consistency (cross-variable rules)
inconsistent = df['retirement_year'] < df['birth_year']
print(f"Retired before birth: {inconsistent.sum()} rows")

# Check for duplicates
duplicates = df.duplicated(subset=['person_id'], keep=False)
print(f"Duplicate IDs: {duplicates.sum()}")

# Validation function for reuse
def validate_data(df):
    issues = []
    if (df['age'] < 0).any():
        issues.append(f"Negative ages: {(df['age'] < 0).sum()}")
    if (df['income'] < 0).any():
        issues.append(f"Negative incomes: {(df['income'] < 0).sum()}")
    if df.duplicated(subset=['person_id']).any():
        issues.append(f"Duplicate IDs found")
    if issues:
        print("Validation failed:\n" + "\n".join(issues))
        return False
    print("All validation checks passed")
    return True

validate_data(df)
* Stata: Data validation

* Check value ranges (impossible values)
assert age >= 0 & age <= 120
assert income >= 0

* Find and report violations (without stopping)
count if age < 0 | age > 120
list if age < 0 | age > 120

* Check categorical values
count if !inlist(gender, "M", "F", "Other")
tab gender if !inlist(gender, "M", "F", "Other")  // Show invalid values

* Check logical consistency
count if retirement_year < birth_year
list person_id birth_year retirement_year if retirement_year < birth_year

* Check for duplicates
duplicates report person_id
duplicates list person_id

* Create validation summary
capture program drop validate_data
program define validate_data
    display "Running validation checks..."
    quietly count if age < 0 | age > 120
    if r(N) > 0 display "WARNING: " r(N) " invalid age values"
    quietly count if income < 0
    if r(N) > 0 display "WARNING: " r(N) " negative incomes"
    quietly duplicates report person_id
    display "Validation complete"
end

validate_data
# R: Data validation
library(tidyverse)

# Check value ranges (impossible values)
stopifnot(all(df$age >= 0 & df$age <= 120, na.rm = TRUE))
stopifnot(all(df$income >= 0, na.rm = TRUE))

# Find and report violations (without stopping)
invalid_age <- df %>%
  filter(age < 0 | age > 120)
cat("Rows with invalid age:", nrow(invalid_age), "\n")
print(invalid_age)

# Check categorical values
valid_genders <- c("M", "F", "Other")
invalid_gender <- df %>%
  filter(!gender %in% valid_genders)
cat("Invalid gender values:", unique(invalid_gender$gender), "\n")

# Check logical consistency
inconsistent <- df %>%
  filter(retirement_year < birth_year)
cat("Retired before birth:", nrow(inconsistent), "rows\n")

# Check for duplicates
duplicates <- df %>%
  group_by(person_id) %>%
  filter(n() > 1) %>%
  ungroup()
cat("Duplicate IDs:", nrow(duplicates), "\n")

# Validation function for reuse
validate_data <- function(df) {
  issues <- c()
  if (any(df$age < 0, na.rm = TRUE)) {
    issues <- c(issues, paste("Negative ages:", sum(df$age < 0, na.rm = TRUE)))
  }
  if (any(df$income < 0, na.rm = TRUE)) {
    issues <- c(issues, paste("Negative incomes:", sum(df$income < 0, na.rm = TRUE)))
  }
  if (any(duplicated(df$person_id))) {
    issues <- c(issues, "Duplicate IDs found")
  }
  if (length(issues) > 0) {
    cat("Validation failed:\n", paste(issues, collapse = "\n"), "\n")
    return(FALSE)
  }
  cat("All validation checks passed\n")
  return(TRUE)
}

validate_data(df)
Python Output
Invalid ages (outside 0-120): 2 person_id age 47 A0048 -999 203 B0204 150 Negative incomes: 0 Invalid gender values: 3 gender count m 1 F. 1 male 1 Retired before birth: 1 rows Duplicate person_ids: 4 Validation failed: Invalid ages found Duplicate IDs found
Stata Output
. count if age < 0 | age > 120 2 . count if !inlist(gender, "M", "F", "Other") 3 . duplicates report person_id Duplicates in terms of person_id -------------------------------------- Copies | Observations Surplus ----------+--------------------------- 1 | 996 0 2 | 4 2 -------------------------------------- . validate_data Running validation checks... WARNING: 2 invalid age values Validation complete
R Output
Invalid ages (outside 0-120): 2 # A tibble: 2 × 2 person_id age <chr> <dbl> 1 A0048 -999 2 B0204 150 Invalid gender values: 3 Retired before birth: 1 rows Duplicate IDs: 4 Validation failed: Invalid ages found Duplicate IDs found
Validation Best Practices

1. Validate early and often — run checks after loading data and after each transformation.
2. Document expected ranges — write down what values are valid for each variable.
3. Don't silently fix — log all corrections for reproducibility.
4. Check relationships — some errors only appear when comparing multiple variables.

4.6 Case Study: CPS Data Cleaning

The Current Population Survey (CPS) is a key data source for labor economics research. Let's walk through cleaning a typical CPS extract, applying all the techniques from this module.

About the CPS

The CPS is a monthly survey of about 60,000 households conducted by the Bureau of Labor Statistics. It provides employment, earnings, and demographic data. Access cleaned extracts from IPUMS CPS.

Data source: Sarah Flood et al. IPUMS CPS: Version 11.0 [dataset]. Minneapolis, MN: IPUMS, 2023.

The code below demonstrates a realistic cleaning workflow: handling special missing value codes, constructing derived variables like hourly wages, and validating the cleaned data.

# Python: CPS Data Cleaning Pipeline
import pandas as pd
import numpy as np

# Load CPS data (assume downloaded from IPUMS)
cps = pd.read_csv('cps_extract.csv')

# Step 1: Handle CPS missing value codes
# IPUMS uses codes like 999999 or 9999999 for NIU/missing
missing_codes = [9999999, 9999998, 999999, 999998]
cps['incwage'] = cps['incwage'].replace(missing_codes, np.nan)
cps['uhrswork'] = cps['uhrswork'].replace([999, 997], np.nan)

# Step 2: Restrict sample (prime-age workers)
cps = cps[
    (cps['age'] >= 25) &
    (cps['age'] <= 54) &
    (cps['incwage'] > 0) &
    (cps['uhrswork'] > 0)
].copy()

# Step 3: Construct hourly wage
cps['hourly_wage'] = cps['incwage'] / (cps['uhrswork'] * 52)

# Step 4: Handle wage outliers
cps = cps[(cps['hourly_wage'] >= 3) & (cps['hourly_wage'] <= 500)]

# Step 5: Create demographic indicators
cps['female'] = (cps['sex'] == 2).astype(int)
cps['educ_cat'] = pd.cut(
    cps['educ'],
    bins=[0, 72, 73, 110, 999],
    labels=['Less than HS', 'High School', 'Some College', 'College+']
)

# Step 6: Create log wage (common in labor economics)
cps['ln_wage'] = np.log(cps['hourly_wage'])

# Step 7: Final validation
print(f"Final sample size: {len(cps):,}")
print(f"Missing wages: {cps['hourly_wage'].isna().sum()}")
print(cps[['hourly_wage', 'ln_wage', 'age']].describe())
* Stata: CPS Data Cleaning Pipeline

* Load CPS data (assume downloaded from IPUMS)
import delimited "cps_extract.csv", clear

* Step 1: Handle CPS missing value codes
* IPUMS uses codes like 999999 or 9999999 for NIU/missing
replace incwage = . if inlist(incwage, 9999999, 9999998, 999999, 999998)
replace uhrswork = . if inlist(uhrswork, 999, 997)

* Step 2: Restrict sample (prime-age workers)
keep if age >= 25 & age <= 54
keep if incwage > 0 & !missing(incwage)
keep if uhrswork > 0 & !missing(uhrswork)

* Step 3: Construct hourly wage
gen hourly_wage = incwage / (uhrswork * 52)
label variable hourly_wage "Hourly wage (annual earnings / annual hours)"

* Step 4: Handle wage outliers
drop if hourly_wage < 3 | hourly_wage > 500

* Step 5: Create demographic indicators
gen female = (sex == 2)
label variable female "Female indicator"
label define female_lbl 0 "Male" 1 "Female"
label values female female_lbl

gen educ_cat = .
replace educ_cat = 1 if educ < 73
replace educ_cat = 2 if educ == 73
replace educ_cat = 3 if educ >= 80 & educ < 111
replace educ_cat = 4 if educ >= 111
label define educ_lbl 1 "Less than HS" 2 "High School" 3 "Some College" 4 "College+"
label values educ_cat educ_lbl

* Step 6: Create log wage (common in labor economics)
gen ln_wage = ln(hourly_wage)
label variable ln_wage "Log hourly wage"

* Step 7: Final validation
count
summarize hourly_wage ln_wage age
misstable summarize hourly_wage ln_wage
# R: CPS Data Cleaning Pipeline
library(tidyverse)

# Load CPS data (assume downloaded from IPUMS)
cps <- read_csv("cps_extract.csv")

# Step 1: Handle CPS missing value codes
# IPUMS uses codes like 999999 or 9999999 for NIU/missing
cps <- cps %>%
  mutate(
    incwage = if_else(incwage %in% c(9999999, 9999998, 999999, 999998), NA_real_, incwage),
    uhrswork = if_else(uhrswork %in% c(999, 997), NA_real_, uhrswork)
  )

# Step 2: Restrict sample (prime-age workers)
cps <- cps %>%
  filter(
    age >= 25 & age <= 54,
    incwage > 0,
    uhrswork > 0
  )

# Step 3: Construct hourly wage
cps <- cps %>%
  mutate(hourly_wage = incwage / (uhrswork * 52))

# Step 4: Handle wage outliers
cps <- cps %>%
  filter(hourly_wage >= 3 & hourly_wage <= 500)

# Step 5: Create demographic indicators
cps <- cps %>%
  mutate(female = as.integer(sex == 2))

cps <- cps %>%
  mutate(
    educ_cat = case_when(
      educ < 73 ~ "Less than HS",
      educ == 73 ~ "High School",
      educ >= 80 & educ < 111 ~ "Some College",
      educ >= 111 ~ "College+"
    ) %>% factor(levels = c("Less than HS", "High School", "Some College", "College+"))
  )

# Step 6: Create log wage (common in labor economics)
cps <- cps %>%
  mutate(ln_wage = log(hourly_wage))

# Step 7: Final validation
cat("Final sample size:", nrow(cps), "\n")
cat("Missing wages:", sum(is.na(cps$hourly_wage)), "\n")
cps %>%
  select(hourly_wage, ln_wage, age) %>%
  summary()
Python Output
Final sample size: 45,823 Missing wages: 0 hourly_wage ln_wage age count 45823.00 45823.00 45823.00 mean 28.45 3.14 39.72 std 22.67 0.68 8.41 min 3.00 1.10 25.00 25% 14.42 2.67 32.00 50% 22.12 3.10 40.00 75% 35.58 3.57 48.00 max 480.77 6.18 54.00
Stata Output
. count 45,823 . summarize hourly_wage ln_wage age Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- hourly_wage | 45,823 28.45312 22.66934 3 480.77 ln_wage | 45,823 3.140215 .6782031 1.098612 6.175867 age | 45,823 39.72148 8.413561 25 54 . misstable summarize hourly_wage ln_wage Obs<. +--------- | Unique Variable | Obs=. Obs>. values Min Max -------------+------------------------------------------------ | (no missing values found)
R Output
Final sample size: 45823 Missing wages: 0 hourly_wage ln_wage age Min. : 3.00 Min. :1.099 Min. :25.00 1st Qu.: 14.42 1st Qu.:2.669 1st Qu.:32.00 Median : 22.12 Median :3.096 Median :40.00 Mean : 28.45 Mean :3.140 Mean :39.72 3rd Qu.: 35.58 3rd Qu.:3.572 3rd Qu.:48.00 Max. :480.77 Max. :6.176 Max. :54.00

4.7 Exercises

Exercise 4.1: Handling Missing Data

Practice identifying and handling missing values. Complete these tasks:

  1. Check for missing values in a dataset
  2. Drop or fill missing values
  3. Create a missing indicator variable

Exercise 4.2: Detecting Outliers

Practice identifying outliers using the z-score method. Complete these tasks:

  1. Calculate z-scores for a numeric variable
  2. Flag observations where |z| > 3 as outliers
  3. Remove or winsorize the outliers

Exercise 4.3: Cleaning String Data

Practice standardizing messy text variables. Complete these tasks:

  1. Standardize case (uppercase or lowercase)
  2. Remove extra whitespace
  3. Extract a pattern using regex (e.g., zip code)

Exercise 4.4: Working with Dates

Practice parsing and manipulating date variables. Complete these tasks:

  1. Parse a date string into a date object
  2. Extract year, month, or quarter from a date
  3. Calculate time difference between dates

Exercise 4.5: Data Validation

Practice validating data for consistency and errors. Complete these tasks:

  1. Check for valid value ranges (e.g., age between 0-120)
  2. Identify duplicate records
  3. Check logical consistency between variables