10 History of NLP

From ELIZA to GPT 70 Years of Language AI Conceptual + Technical

The story of Natural Language Processing is a fascinating journey from ambitious dreams to unexpected breakthroughs. Understanding this history isn't just academic curiosity—it reveals why modern language models work the way they do, and helps you use them more effectively.

Learning Objectives

Trace the evolution of NLP from hand-crafted rules to self-supervised learning
Understand the key algorithmic breakthroughs and why they mattered
Grasp the technical intuition behind word embeddings, attention, and transformers
Connect historical developments to the capabilities and limitations of modern LLMs
Appreciate the role of scale, data, and compute in the AI revolution

10.0 The Big Picture: A 70-Year Journey
10.1 The Symbolic Era (1950s-1980s)
10.2 The Statistical Revolution (1990s-2000s)
10.3 Word Embeddings: Meaning as Geometry (2013)
10.4 Sequence Models: Learning to Remember
10.5 Attention: The Breakthrough (2014-2017)
10.6 Transformers: The Architecture That Changed Everything
10.7 The Modern Era: Scale, RLHF, and Beyond
10.8 Why This Matters for Economists
References & Further Reading

10.0 The Big Picture: A 70-Year Journey

Before diving into details, let's see the forest before the trees. NLP has gone through four major paradigms, each representing a fundamentally different philosophy about how machines should process language:

🏛️ Era 1: Symbolic (1950s-1980s)

Philosophy: Language follows rules; encode them
Method: Hand-written grammars, pattern matching
Bottleneck: Rules can't capture all of language
Legacy: ELIZA, expert systems

📊 Era 2: Statistical (1990s-2000s)

Philosophy: Language is probabilistic; learn from data
Method: N-grams, HMMs, Naive Bayes, SVMs
Bottleneck: Feature engineering is manual
Legacy: Spam filters, early MT

🧠 Era 3: Neural (2013-2017)

Philosophy: Learn representations automatically
Method: Word2Vec, RNNs, LSTMs, Seq2Seq
Bottleneck: Sequential processing is slow
Legacy: Word embeddings, neural MT

🚀 Era 4: Transformer (2017-present)

Philosophy: Attention + scale = emergence
Method: Self-attention, pre-training, RLHF
Bottleneck: Compute, alignment, hallucination
Legacy: BERT, GPT, Claude, and the AI revolution

💡 Key Insight

Each paradigm shift wasn't just incremental improvement—it was a fundamental rethinking of the problem. The move from rules to statistics said "let data speak." The move from statistics to neural networks said "let the model find its own features." The move to transformers said "let every word attend to every other word, in parallel." Understanding these shifts helps you understand what modern models can and cannot do.

10.1 The Symbolic Era (1950s-1980s)

The Dream of Machine Intelligence

The story begins with Alan Turing's 1950 paper "Computing Machinery and Intelligence", which posed the question that still drives AI research: Can machines think?

Turing proposed what we now call the Turing Test: a machine passes if a human interrogator, communicating only through text, cannot reliably distinguish it from a human. This framing was remarkably prescient—it anticipated that language would be central to machine intelligence.

"I propose to consider the question, 'Can machines think?' This should begin with definitions of the meaning of the terms 'machine' and 'think.'"
— Alan Turing, Computing Machinery and Intelligence (1950)

ELIZA: The First Chatbot (1966)

Joseph Weizenbaum at MIT created ELIZA, one of the first programs to engage in human-like conversation. ELIZA simulated a Rogerian psychotherapist using simple pattern matching and substitution rules.

🤖 ELIZA in Action

# A typical ELIZA conversation

User:    I am feeling sad today.
ELIZA:   I am sorry to hear you are feeling sad today.

User:    My mother doesn't understand me.
ELIZA:   Tell me more about your family.

User:    I think you don't really understand me either.
ELIZA:   What makes you think I don't really understand you?

ELIZA worked by matching keywords ("mother" → family topic) and transforming sentences with templates. It had no understanding whatsoever—just clever rules. Yet users found it surprisingly engaging, a phenomenon Weizenbaum called the "ELIZA effect."

The ELIZA Effect

People have a strong tendency to attribute understanding and emotion to systems that use language, even very simple ones. This psychological phenomenon is crucial for understanding public reactions to modern chatbots—and for being appropriately skeptical about what "understanding" really means.

How ELIZA Actually Worked

Let's peek under the hood. ELIZA's "intelligence" was a set of pattern-matching rules:

⚙️ ELIZA's Pattern Matching Rules (Simplified)

# Simplified ELIZA-style rules in Python

rules = [
    # (pattern, response_template)
    (r"I am (.*)",          "Why do you say you are {0}?"),
    (r"I feel (.*)",        "Tell me more about feeling {0}."),
    (r"(.*) mother (.*)",   "Tell me more about your family."),
    (r"(.*) father (.*)",   "How do you feel about your father?"),
    (r"I think (.*)",       "What makes you think {0}?"),
    (r"(.*)",               "Please go on."),  # fallback
]

def eliza_respond(user_input):
    for pattern, response in rules:
        match = re.match(pattern, user_input, re.IGNORECASE)
        if match:
            return response.format(*match.groups())
    return "I see. Please continue."

The Chomsky Paradigm

Meanwhile, Noam Chomsky's work dominated academic linguistics. His theory of transformational grammar proposed that language was governed by innate, universal rules—and that statistical approaches were fundamentally misguided.

"It must be recognized that the notion 'probability of a sentence' is an entirely useless one."
— Noam Chomsky, Syntactic Structures (1957)

Chomsky illustrated this with the famous example: "Colorless green ideas sleep furiously" is grammatically correct but meaningless and has zero probability in any corpus—yet we can understand it. He argued this proved that language wasn't about statistics.

🔬 Historical Perspective

Chomsky was half-right. Human language does have deep structure that pure statistics might miss. But he underestimated what statistical methods could achieve with enough data and the right architectures. The irony is that modern neural networks have learned grammatical structure purely from statistics—they can distinguish grammatical from ungrammatical sentences, even weird ones, without being told any rules.

The ALPAC Report and the First AI Winter (1966)

The Automatic Language Processing Advisory Committee (ALPAC) issued a devastating report concluding that machine translation was too expensive and too poor quality to be practical. Government funding for NLP research collapsed, leading to what's called the first "AI Winter."

The report was correct about the limitations of rule-based approaches. It was wrong about the ultimate possibility of machine translation—it just required a completely different approach.

10.2 The Statistical Revolution (1990s-2000s)

The statistical revolution came when researchers stopped trying to encode linguistic rules and started treating language as data to be modeled probabilistically. This paradigm shift was driven by researchers at IBM working on speech recognition.

"Every time I fire a linguist, the performance of the speech recognizer goes up."
— Fred Jelinek, IBM (apocryphal, but captures the spirit)

The Core Idea: Language Modeling

A language model assigns probabilities to sequences of words. The fundamental question: given some words, what word is likely to come next?

P("the cat sat on the mat") = P(the) \times P(cat|the) \times P(sat|the,cat) \times P(on|the,cat,sat) \times ...

This might seem like a detour from "understanding" language, but it turns out that predicting the next word requires capturing a tremendous amount of linguistic knowledge.

N-gram Models: The Markov Assumption

Computing the full conditional probability $P(word|all previous words)$ is intractable. The Markov assumption simplifies this: only look at the last n-1 words.

📊 N-gram Language Models

# N-gram probability estimation

# Unigram (n=1): Each word is independent
P("cat") = count("cat") / total_words

# Bigram (n=2): Depends on previous word only
P("cat" | "the") = count("the cat") / count("the")

# Trigram (n=3): Depends on two previous words
P("mat" | "on the") = count("on the mat") / count("on the")

# Example with a toy corpus: "the cat sat on the mat"
# Bigram counts: "the cat":1, "cat sat":1, "sat on":1, "on the":1, "the mat":1

P("cat" | "the") = 1/2 = 0.5   # "the" appears twice, followed by "cat" once
P("mat" | "the") = 1/2 = 0.5   # "the" appears twice, followed by "mat" once
P("dog" | "the") = 0/2 = 0.0   # Never seen "the dog" - assigns zero!

The Sparsity Problem

N-gram models suffer from sparsity: most possible n-grams never appear in training data. If you've never seen "the dog" in your corpus, is it impossible? Of course not!

Smoothing Techniques

Various "smoothing" methods were developed to handle unseen n-grams: add-one smoothing (Laplace), Good-Turing estimation, Kneser-Ney smoothing. These assign small but non-zero probabilities to unseen events. Kneser-Ney, in particular, remained state-of-the-art for language modeling well into the neural era.

Bag of Words and TF-IDF

For document classification tasks (spam detection, sentiment analysis, topic categorization), simpler representations often worked well:

📝 Bag of Words

Represent document as word counts
Ignores word order entirely
"The cat sat on the mat" = {the:2, cat:1, sat:1, on:1, mat:1}
Simple but surprisingly effective

📈 TF-IDF

Term Frequency: How often word appears in document
Inverse Document Frequency: Penalize common words
TF-IDF = TF × log(N/df)
Still used in search engines today

Hidden Markov Models (HMMs)

For sequence labeling tasks like part-of-speech tagging (assigning noun, verb, adjective to each word), Hidden Markov Models became the dominant approach. HMMs model sequences where the underlying "state" (the part of speech) is hidden, and we only observe the words.

🏷️ Part-of-Speech Tagging with HMMs

# Given a sentence, find the most likely sequence of tags

sentence = ["The", "cat", "sat", "quickly"]

# HMM estimates two types of probabilities:

# 1. Transition probabilities: P(tag_i | tag_{i-1})
P(NOUN | DET) = 0.6      # After "the", nouns are common
P(VERB | NOUN) = 0.4     # After a noun, verbs are common

# 2. Emission probabilities: P(word | tag)
P("cat" | NOUN) = 0.01   # "cat" is a noun
P("sat" | VERB) = 0.005  # "sat" is a verb

# Viterbi algorithm finds: DET → NOUN → VERB → ADV
tags = ["DET", "NOUN", "VERB", "ADV"]

The IBM Translation Models (1990s)

IBM researchers pioneered statistical machine translation by learning to align words between languages from parallel corpora (texts in two languages that are translations of each other).

The key insight: translation could be decomposed into a language model (how likely is this English sentence?) and a translation model (how likely is this English given the French?). Using Bayes' theorem:

P(English | French) \propto P(French | English) \times P(English)

This approach dominated machine translation for nearly two decades, culminating in systems like Google Translate (pre-neural version).

10.3 Word Embeddings: Meaning as Geometry (2013)

The breakthrough that bridged statistical and neural NLP came from a deceptively simple idea: represent words as dense vectors in a continuous space, where similar words are close together.

The Distributional Hypothesis

The theoretical foundation comes from linguistics:

"You shall know a word by the company it keeps."
— J.R. Firth (1957)

Words that appear in similar contexts have similar meanings. "Dog" and "cat" appear near words like "pet," "fur," "veterinarian." "King" and "queen" appear near "royal," "throne," "crown."

Word2Vec: The Revolution (2013)

Tomas Mikolov and colleagues at Google introduced Word2Vec, which trained neural networks to predict context words. The hidden layer weights became word embeddings.

Skip-gram

Given center word, predict context
Input: "cat"
Predict: "the", "sat", "on", "mat"
Better for rare words

CBOW (Continuous Bag of Words)

Given context, predict center word
Input: "the", "sat", "on", "mat"
Predict: "cat"
Faster to train

The Magic of Vector Arithmetic

Word2Vec's most famous result was that semantic relationships became vector operations:

✨ The Magic of Word Vectors

# These relationships emerged automatically from text!

# Gender relationship
vector("king") - vector("man") + vector("woman") ≈ vector("queen")

# Capital cities
vector("paris") - vector("france") + vector("italy") ≈ vector("rome")

# Verb tense
vector("walking") - vector("walk") + vector("swim") ≈ vector("swimming")

# Superlatives
vector("biggest") - vector("big") + vector("small") ≈ vector("smallest")

The model learned these relationships purely from word co-occurrence patterns—no explicit knowledge was provided!

🧮 Technical Intuition

Why does this work? The direction from "man" to "woman" captures the concept of "gender change." The direction from "France" to "Paris" captures "capital of." When you add and subtract these directions, you're combining concepts algebraically. It's not perfect, but it's remarkable that it works at all—and that it emerged automatically from prediction.

GloVe: Combining Count-Based and Neural Methods (2014)

Stanford's GloVe (Global Vectors) combined the insights of count-based methods (like TF-IDF) with neural embedding learning. Instead of predicting context words directly, GloVe trained on word co-occurrence statistics from the entire corpus.

The Limitation: One Vector Per Word

Word2Vec and GloVe assign exactly one vector to each word. But words have multiple meanings:

"bank": financial institution vs. river bank
"cell": biological cell vs. prison cell vs. phone
"right": correct vs. opposite of left vs. political leaning

These static embeddings average all meanings together. Solving this required models that could understand context—which led to the next breakthrough.

10.4 Sequence Models: Learning to Remember

To capture context, we need models that process sequences of words and maintain a "memory" of what they've seen. Enter Recurrent Neural Networks (RNNs).

The RNN Architecture

An RNN processes one word at a time, maintaining a hidden state that gets updated at each step:

🔄 RNN Processing Step-by-Step

# Processing "The cat sat on the mat"

h_0 = [0, 0, 0, ...]        # Initial hidden state (zeros)

# Step 1: Process "The"
h_1 = tanh(W_h @ h_0 + W_x @ embed("The") + b)

# Step 2: Process "cat" - h_1 carries info about "The"
h_2 = tanh(W_h @ h_1 + W_x @ embed("cat") + b)

# Step 3: Process "sat" - h_2 carries info about "The cat"
h_3 = tanh(W_h @ h_2 + W_x @ embed("sat") + b)

# ... and so on

# The hidden state h_t encodes context from all previous words
# But information from early words gets "diluted" over time

The Vanishing Gradient Problem

Standard RNNs have a critical flaw: during training, gradients either vanish (shrink to nearly zero) or explode (grow uncontrollably) when backpropagating through many time steps. This means:

The network struggles to learn long-range dependencies
Information from early words gets lost
In "The cat that the dog chased ran away," connecting "cat" to "ran" is hard

LSTMs: Learning What to Remember (1997)

Long Short-Term Memory networks, invented by Sepp Hochreiter and Jürgen Schmidhuber, solved this with a clever architecture using gates that control information flow:

🚪 Forget Gate

Decides what information from the previous cell state to discard. "Should I still remember that the subject was 'cat'?"

📥 Input Gate

Decides what new information to store. "Is this word important enough to remember?"

📤 Output Gate

Decides what to output based on the cell state. "What's relevant for the current prediction?"

🔬 Why Gates Matter

The gates allow the network to maintain information over long distances. The key is the cell state—a "highway" that information can flow along unchanged. The forget gate can be set to nearly 1.0, meaning "keep everything," allowing gradients to flow unimpeded. This architectural innovation was crucial—and the idea of gating information flow appears again in transformers.

Sequence-to-Sequence Models (2014)

For tasks like translation, where input and output lengths differ, Sutskever et al. introduced the encoder-decoder architecture:

🔄 Seq2Seq for Translation

# Translating "The cat sat" → "Le chat s'assit"

# ENCODER: Process input and compress to a single vector
h_1 = LSTM("The", h_0)
h_2 = LSTM("cat", h_1)
h_3 = LSTM("sat", h_2)
context = h_3  # This single vector must encode the entire input!

# DECODER: Generate output one word at a time
s_1 = LSTM("<START>", context) → "Le"
s_2 = LSTM("Le", s_1)          → "chat"
s_3 = LSTM("chat", s_2)        → "s'assit"
s_4 = LSTM("s'assit", s_3)    → "<END>"

This was the state of the art for machine translation in 2014-2016. But there was a fundamental problem: the entire input had to be compressed into a single fixed-size vector. For long sentences, this bottleneck caused severe information loss.

10.5 Attention: The Breakthrough (2014-2017)

The attention mechanism, introduced by Bahdanau et al. (2015), solved the bottleneck problem with an elegant idea: instead of forcing everything through one vector, let the decoder look back at all encoder states and decide which ones are relevant at each step.

How Attention Works

At each decoding step, attention computes a weighted sum of all encoder hidden states. The weights reflect "how relevant is each input word for generating this output word?"

👁️ Attention Mechanism

# Generating the French word "chat" (cat)

# Encoder states from "The cat sat"
encoder_states = [h_the, h_cat, h_sat]

# Current decoder state
decoder_state = s_1

# Compute attention scores (how relevant is each encoder state?)
scores = [dot(s_1, h_the),   # 0.1 - "The" not very relevant
          dot(s_1, h_cat),   # 0.8 - "cat" highly relevant!
          dot(s_1, h_sat)]   # 0.1 - "sat" not very relevant

# Convert to probabilities with softmax
attention_weights = softmax(scores)  # [0.1, 0.8, 0.1]

# Weighted sum of encoder states
context = 0.1*h_the + 0.8*h_cat + 0.1*h_sat

# Use context to help predict "chat"
output = predict(decoder_state, context)  # → "chat"

Attention as Soft Alignment

Attention learns which input words "align" with which output words. When translating "chat," the model focuses on "cat." When generating a verb, it attends to the input verb. This alignment emerges automatically from training—no one told the model which words correspond to which.

The Transformer Insight (2017)

The landmark paper "Attention Is All You Need" by Vaswani et al. asked a radical question: what if we got rid of recurrence entirely and used only attention?

The key innovation was self-attention: instead of the decoder attending to encoder states, every word attends to every other word in the same sequence.

RNN/LSTM	Transformer
Processes words sequentially	Processes all words in parallel
Slow to train (can't parallelize across time)	Fast to train (massively parallelizable)
Information travels through hidden states	Every word can directly attend to every other
Struggles with very long sequences	Handles long sequences (up to context limit)
O(n) sequential operations	O(1) sequential operations, O(n²) attention

10.6 Transformers: The Architecture That Changed Everything

The Transformer architecture has become the foundation of virtually all modern language AI. Let's understand its key components.

Self-Attention: The Core Mechanism

In self-attention, each word computes a query, key, and value:

🔑 Query, Key, Value

# For each word, compute Q, K, V by linear projection
Q = X @ W_Q  # Query: "What am I looking for?"
K = X @ W_K  # Key: "What do I contain?"
V = X @ W_V  # Value: "What do I contribute?"

# Attention scores: how much should each word attend to each other word?
scores = Q @ K.T / sqrt(d_k)  # Scale by dimension

# Softmax to get probabilities
attention = softmax(scores)

# Output: weighted sum of values
output = attention @ V

Think of it like a library: Query = "I'm looking for books about cats", Key = "This shelf has books about animals", Value = "Here are the actual books." High query-key similarity means the value gets more weight.

Multi-Head Attention

Instead of single attention, transformers use multiple attention heads that each learn different patterns:

Head 1 might learn syntactic dependencies (subject-verb agreement)
Head 2 might learn semantic relationships (pronoun resolution)
Head 3 might learn positional patterns (what's nearby)

Positional Encoding

Since attention is permutation-invariant (order doesn't matter by default), transformers add positional encodings to indicate where each word is in the sequence:

PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

These sinusoidal functions allow the model to learn relative positions and generalize to sequence lengths not seen during training.

The Full Architecture

A transformer layer combines:

Multi-head self-attention
Layer normalization (stabilizes training)
Feed-forward network (processes each position independently)
Residual connections (allows gradients to flow)

Stack many such layers (12 for BERT-base, 96 for GPT-4), and you have a modern language model.

🏗️ Why This Architecture Works

The transformer's power comes from several factors: (1) Every word can directly attend to every other, enabling global reasoning; (2) All positions are processed in parallel, enabling massive scaling; (3) Deep stacking with residual connections allows complex computations; (4) The architecture is remarkably amenable to scaling—bigger models consistently perform better. This last property enabled the scaling revolution that produced GPT-3 and beyond.

10.7 The Modern Era: Scale, RLHF, and Beyond

The Scaling Hypothesis

A remarkable empirical finding: language model performance improves predictably with scale. The scaling laws discovered by OpenAI and others show:

Performance \propto (Parameters)^α \times (Data)^β \times (Compute)^γ

This meant that if you wanted better models, you "just" needed to make them bigger and train them on more data with more compute. This insight drove the race from millions to billions to trillions of parameters.

Timeline of Pre-trained Models

2018

BERT (Google)

Bidirectional pre-training with masked language modeling. Revolutionized NLP benchmarks. 110M-340M parameters.

2018

GPT (OpenAI)

Unidirectional (left-to-right) pre-training. Demonstrated the power of generative pre-training. 117M parameters.

2019

GPT-2

Scaled to 1.5B parameters. Generated such coherent text that OpenAI initially withheld the full model, citing concerns about misuse.

2020
GPT-3

              175 billion parameters. Demonstrated remarkable "few-shot learning"—performing tasks from just a few examples in the prompt, with no fine-tuning.
            

2020

T5 (Google)

"Text-to-Text Transfer Transformer." Unified framework that casts all NLP tasks as text generation.

2022
ChatGPT (OpenAI)

              Applied RLHF (Reinforcement Learning from Human Feedback) to make models more helpful, harmless, and honest. Sparked the current AI boom.
            

2023

GPT-4

Multimodal (text + images). Advanced reasoning. Architecture details not disclosed.

2023+

Claude (Anthropic)

Constitutional AI approach to safety. Focus on being helpful, harmless, and honest through AI feedback and red-teaming.

2023+

Open-Source Models

LLaMA (Meta), Mistral, Mixtral, and others democratize access to powerful models. Smaller models achieve competitive performance.

RLHF: Teaching Models to Be Helpful

Pre-training creates models that predict text well, but not necessarily models that are useful or safe. Reinforcement Learning from Human Feedback bridges this gap:

Supervised Fine-Tuning: Train on examples of good conversations
Reward Model: Train a model to predict which responses humans prefer
RL Optimization: Optimize the language model to generate responses the reward model rates highly

The Alignment Problem

RLHF is part of the broader "alignment" challenge: ensuring AI systems do what we actually want, not just what we literally said. A model optimized purely for engagement might learn to be manipulative. One optimized for helpfulness might overpromise. Getting this right is one of the central challenges in AI safety.

Emergent Capabilities

Large language models exhibit capabilities that weren't explicitly trained for and weren't present in smaller models:

In-context learning: Learning new tasks from examples in the prompt
Chain-of-thought reasoning: Solving problems step-by-step when prompted to "think aloud"
Code generation: Writing functional programs from descriptions
Multilingual transfer: Capabilities learned in one language appearing in others

Whether these are truly "emergent" or artifacts of measurement is debated, but the practical impact is undeniable.

10.8 Why This Matters for Economists

Understanding NLP history isn't just intellectual curiosity—it has direct implications for empirical research.

Text as Data in Economics

Modern NLP enables economists to analyze text at scale:

📰 Policy Analysis

Fed communications and market reactions
Regulatory text and compliance costs
Political speech and polarization

📈 Financial Applications

Earnings call sentiment analysis
News-based volatility prediction
10-K filing analysis

🏭 Firm Behavior

Patent text and innovation measurement
Job postings and skill demand
Product descriptions and market positioning

👥 Social Science

Survey open-ended responses
Social media analysis
Historical document analysis

Practical Implications for Research

When to Use What

Simple classification (spam, sentiment): Traditional ML (logistic regression, random forests) often suffices and is more interpretable
Similarity/clustering: Word embeddings (Word2Vec, GloVe) or sentence embeddings (Sentence-BERT)
Named entity recognition: Fine-tuned BERT models
Text generation, summarization, Q&A: Large language models (GPT, Claude)
Domain-specific tasks: Consider fine-tuning or domain-specific models (e.g., FinBERT for finance)

Caveats for Research

Reproducibility: LLM outputs can vary; document versions, prompts, and settings carefully
Bias: Models reflect training data biases; validate on your specific domain
Cost: API costs can add up; design efficient pipelines
Interpretability: Neural models are harder to interpret than traditional methods
Data contamination: Test data may have been in training data for large models

Key Papers Using NLP in Economics

Gentzkow & Shapiro (2010): Media bias measurement using text
Hansen & McMahon (2016): FOMC communication and forward guidance
Bloom et al. (2018): Economic policy uncertainty index
Ash et al. (2023): Text as Data in Economics (survey)

References & Further Reading

Foundational Papers

Computing Machinery and Intelligence

Turing, A. M. (1950)

Mind, 59(236), 433-460. The paper that started it all.

Efficient Estimation of Word Representations in Vector Space

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013)

The Word2Vec paper. Introduced skip-gram and CBOW.

GloVe: Global Vectors for Word Representation

Pennington, J., Socher, R., & Manning, C. (2014)

EMNLP 2014. Combined count-based and neural methods.

Sequence to Sequence Learning with Neural Networks

Sutskever, I., Vinyals, O., & Le, Q. V. (2014)

NeurIPS 2014. Introduced seq2seq for machine translation.

Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau, D., Cho, K., & Bengio, Y. (2015)

ICLR 2015. Introduced the attention mechanism.

Attention Is All You Need

Vaswani, A., et al. (2017)

NeurIPS 2017. The Transformer paper that changed everything.

BERT: Pre-training of Deep Bidirectional Transformers

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019)

NAACL 2019. Bidirectional pre-training revolutionized NLP benchmarks.

Language Models are Few-Shot Learners

Brown, T., et al. (2020)

NeurIPS 2020. The GPT-3 paper demonstrating emergent in-context learning.

Scaling and Training

Scaling Laws for Neural Language Models

Kaplan, J., et al. (2020)

OpenAI. Empirical laws relating model size, data, and compute to performance.

Training language models to follow instructions with human feedback

Ouyang, L., et al. (2022)

NeurIPS 2022. The InstructGPT paper introducing RLHF for alignment.

Constitutional AI: Harmlessness from AI Feedback

Bai, Y., et al. (2022)

Anthropic. Alternative approach to alignment using AI feedback.

Text as Data in Economics

What Drives Media Slant? Evidence from U.S. Daily Newspapers

Gentzkow, M., & Shapiro, J. (2010)

Econometrica. Pioneering use of text analysis in economics.

Shocking Language: Understanding the Macroeconomic Effects of Central Bank Communication

Hansen, S., & McMahon, M. (2016)

QJE. Text analysis of Federal Reserve communications.

Text as Data in Economics

Ash, E., & Hansen, S. (2023)

Annual Review of Economics. Comprehensive survey of NLP in economics research.

Books & Tutorials

Speech and Language Processing

Jurafsky, D., & Martin, J. H. (3rd ed., in progress)

The standard NLP textbook. Free online.

Dive into Deep Learning

Zhang, A., Lipton, Z., Li, M., & Smola, A.

Interactive deep learning textbook with code. Excellent transformer chapters.

CS224N: NLP with Deep Learning

Stanford University (Manning, et al.)

Excellent video lectures covering modern NLP. Free on YouTube.

The Illustrated Transformer

Alammar, J.

Visual explanation of transformer architecture. Highly recommended.

Historical Interest

ELIZA—A Computer Program For the Study of Natural Language Communication

Weizenbaum, J. (1966)

CACM. The original ELIZA paper.

Long Short-Term Memory

Hochreiter, S., & Schmidhuber, J. (1997)

Neural Computation. The LSTM paper that solved vanishing gradients.

ALPAC Report: Language and Machines

ALPAC (1966)

National Academy of Sciences. The report that triggered the first AI winter for NLP.

ProTools ER1

Course Modules

10 History of NLP

Learning Objectives

Table of Contents

10.0 The Big Picture: A 70-Year Journey

🏛️ Era 1: Symbolic (1950s-1980s)

📊 Era 2: Statistical (1990s-2000s)

🧠 Era 3: Neural (2013-2017)

🚀 Era 4: Transformer (2017-present)

10.1 The Symbolic Era (1950s-1980s)

The Dream of Machine Intelligence

ELIZA: The First Chatbot (1966)

How ELIZA Actually Worked

The Chomsky Paradigm

The ALPAC Report and the First AI Winter (1966)

10.2 The Statistical Revolution (1990s-2000s)

The Core Idea: Language Modeling

N-gram Models: The Markov Assumption

The Sparsity Problem

Bag of Words and TF-IDF

📝 Bag of Words

📈 TF-IDF

Hidden Markov Models (HMMs)

The IBM Translation Models (1990s)

10.3 Word Embeddings: Meaning as Geometry (2013)

The Distributional Hypothesis

Word2Vec: The Revolution (2013)

Skip-gram

CBOW (Continuous Bag of Words)

The Magic of Vector Arithmetic

GloVe: Combining Count-Based and Neural Methods (2014)

The Limitation: One Vector Per Word

10.4 Sequence Models: Learning to Remember

The RNN Architecture

The Vanishing Gradient Problem

LSTMs: Learning What to Remember (1997)

🚪 Forget Gate

📥 Input Gate

📤 Output Gate

Sequence-to-Sequence Models (2014)

10.5 Attention: The Breakthrough (2014-2017)

How Attention Works

The Transformer Insight (2017)

10.6 Transformers: The Architecture That Changed Everything

Self-Attention: The Core Mechanism

Multi-Head Attention

Positional Encoding

The Full Architecture

10.7 The Modern Era: Scale, RLHF, and Beyond

The Scaling Hypothesis

Timeline of Pre-trained Models

RLHF: Teaching Models to Be Helpful

Emergent Capabilities

10.8 Why This Matters for Economists

Text as Data in Economics

📰 Policy Analysis

📈 Financial Applications

🏭 Firm Behavior

👥 Social Science

Practical Implications for Research

Key Papers Using NLP in Economics

References & Further Reading

Foundational Papers

Scaling and Training

Text as Data in Economics

Books & Tutorials

Historical Interest

ProTools ER1 Assistant