10 History of NLP
The story of Natural Language Processing is a fascinating journey from ambitious dreams to unexpected breakthroughs. Understanding this history isn't just academic curiosity—it reveals why modern language models work the way they do, and helps you use them more effectively.
Learning Objectives
- Trace the evolution of NLP from hand-crafted rules to self-supervised learning
- Understand the key algorithmic breakthroughs and why they mattered
- Grasp the technical intuition behind word embeddings, attention, and transformers
- Connect historical developments to the capabilities and limitations of modern LLMs
- Appreciate the role of scale, data, and compute in the AI revolution
Table of Contents
- 10.0 The Big Picture: A 70-Year Journey
- 10.1 The Symbolic Era (1950s-1980s)
- 10.2 The Statistical Revolution (1990s-2000s)
- 10.3 Word Embeddings: Meaning as Geometry (2013)
- 10.4 Sequence Models: Learning to Remember
- 10.5 Attention: The Breakthrough (2014-2017)
- 10.6 Transformers: The Architecture That Changed Everything
- 10.7 The Modern Era: Scale, RLHF, and Beyond
- 10.8 Why This Matters for Economists
- References & Further Reading
10.0 The Big Picture: A 70-Year Journey
Before diving into details, let's see the forest before the trees. NLP has gone through four major paradigms, each representing a fundamentally different philosophy about how machines should process language:
๐๏ธ Era 1: Symbolic (1950s-1980s)
- Philosophy: Language follows rules; encode them
- Method: Hand-written grammars, pattern matching
- Bottleneck: Rules can't capture all of language
- Legacy: ELIZA, expert systems
๐ Era 2: Statistical (1990s-2000s)
- Philosophy: Language is probabilistic; learn from data
- Method: N-grams, HMMs, Naive Bayes, SVMs
- Bottleneck: Feature engineering is manual
- Legacy: Spam filters, early MT
๐ง Era 3: Neural (2013-2017)
- Philosophy: Learn representations automatically
- Method: Word2Vec, RNNs, LSTMs, Seq2Seq
- Bottleneck: Sequential processing is slow
- Legacy: Word embeddings, neural MT
๐ Era 4: Transformer (2017-present)
- Philosophy: Attention + scale = emergence
- Method: Self-attention, pre-training, RLHF
- Bottleneck: Compute, alignment, hallucination
- Legacy: BERT, GPT, Claude, and the AI revolution
Each paradigm shift wasn't just incremental improvement—it was a fundamental rethinking of the problem. The move from rules to statistics said "let data speak." The move from statistics to neural networks said "let the model find its own features." The move to transformers said "let every word attend to every other word, in parallel." Understanding these shifts helps you understand what modern models can and cannot do.
10.1 The Symbolic Era (1950s-1980s)
The Dream of Machine Intelligence
The story begins with Alan Turing's 1950 paper "Computing Machinery and Intelligence", which posed the question that still drives AI research: Can machines think?
Turing proposed what we now call the Turing Test: a machine passes if a human interrogator, communicating only through text, cannot reliably distinguish it from a human. This framing was remarkably prescient—it anticipated that language would be central to machine intelligence.
"I propose to consider the question, 'Can machines think?' This should begin with definitions of the meaning of the terms 'machine' and 'think.'"
— Alan Turing, Computing Machinery and Intelligence (1950)
ELIZA: The First Chatbot (1966)
Joseph Weizenbaum at MIT created ELIZA, one of the first programs to engage in human-like conversation. ELIZA simulated a Rogerian psychotherapist using simple pattern matching and substitution rules.
# A typical ELIZA conversation
User: I am feeling sad today.
ELIZA: I am sorry to hear you are feeling sad today.
User: My mother doesn't understand me.
ELIZA: Tell me more about your family.
User: I think you don't really understand me either.
ELIZA: What makes you think I don't really understand you?
ELIZA worked by matching keywords ("mother" โ family topic) and transforming sentences with templates. It had no understanding whatsoever—just clever rules. Yet users found it surprisingly engaging, a phenomenon Weizenbaum called the "ELIZA effect."
People have a strong tendency to attribute understanding and emotion to systems that use language, even very simple ones. This psychological phenomenon is crucial for understanding public reactions to modern chatbots—and for being appropriately skeptical about what "understanding" really means.
How ELIZA Actually Worked
Let's peek under the hood. ELIZA's "intelligence" was a set of pattern-matching rules:
# Simplified ELIZA-style rules in Python
rules = [
# (pattern, response_template)
(r"I am (.*)", "Why do you say you are {0}?"),
(r"I feel (.*)", "Tell me more about feeling {0}."),
(r"(.*) mother (.*)", "Tell me more about your family."),
(r"(.*) father (.*)", "How do you feel about your father?"),
(r"I think (.*)", "What makes you think {0}?"),
(r"(.*)", "Please go on."), # fallback
]
def eliza_respond(user_input):
for pattern, response in rules:
match = re.match(pattern, user_input, re.IGNORECASE)
if match:
return response.format(*match.groups())
return "I see. Please continue."
The Chomsky Paradigm
Meanwhile, Noam Chomsky's work dominated academic linguistics. His theory of transformational grammar proposed that language was governed by innate, universal rules—and that statistical approaches were fundamentally misguided.
"It must be recognized that the notion 'probability of a sentence' is an entirely useless one."
— Noam Chomsky, Syntactic Structures (1957)
Chomsky illustrated this with the famous example: "Colorless green ideas sleep furiously" is grammatically correct but meaningless and has zero probability in any corpus—yet we can understand it. He argued this proved that language wasn't about statistics.
Chomsky was half-right. Human language does have deep structure that pure statistics might miss. But he underestimated what statistical methods could achieve with enough data and the right architectures. The irony is that modern neural networks have learned grammatical structure purely from statistics—they can distinguish grammatical from ungrammatical sentences, even weird ones, without being told any rules.
The ALPAC Report and the First AI Winter (1966)
The Automatic Language Processing Advisory Committee (ALPAC) issued a devastating report concluding that machine translation was too expensive and too poor quality to be practical. Government funding for NLP research collapsed, leading to what's called the first "AI Winter."
The report was correct about the limitations of rule-based approaches. It was wrong about the ultimate possibility of machine translation—it just required a completely different approach.
10.2 The Statistical Revolution (1990s-2000s)
The statistical revolution came when researchers stopped trying to encode linguistic rules and started treating language as data to be modeled probabilistically. This paradigm shift was driven by researchers at IBM working on speech recognition.
"Every time I fire a linguist, the performance of the speech recognizer goes up."
— Fred Jelinek, IBM (apocryphal, but captures the spirit)
The Core Idea: Language Modeling
A language model assigns probabilities to sequences of words. The fundamental question: given some words, what word is likely to come next?
This might seem like a detour from "understanding" language, but it turns out that predicting the next word requires capturing a tremendous amount of linguistic knowledge.
N-gram Models: The Markov Assumption
Computing the full conditional probability P(word|all previous words) is intractable. The Markov assumption simplifies this: only look at the last n-1 words.
# N-gram probability estimation
# Unigram (n=1): Each word is independent
P("cat") = count("cat") / total_words
# Bigram (n=2): Depends on previous word only
P("cat" | "the") = count("the cat") / count("the")
# Trigram (n=3): Depends on two previous words
P("mat" | "on the") = count("on the mat") / count("on the")
# Example with a toy corpus: "the cat sat on the mat"
# Bigram counts: "the cat":1, "cat sat":1, "sat on":1, "on the":1, "the mat":1
P("cat" | "the") = 1/2 = 0.5 # "the" appears twice, followed by "cat" once
P("mat" | "the") = 1/2 = 0.5 # "the" appears twice, followed by "mat" once
P("dog" | "the") = 0/2 = 0.0 # Never seen "the dog" - assigns zero!
The Sparsity Problem
N-gram models suffer from sparsity: most possible n-grams never appear in training data. If you've never seen "the dog" in your corpus, is it impossible? Of course not!
Various "smoothing" methods were developed to handle unseen n-grams: add-one smoothing (Laplace), Good-Turing estimation, Kneser-Ney smoothing. These assign small but non-zero probabilities to unseen events. Kneser-Ney, in particular, remained state-of-the-art for language modeling well into the neural era.
Bag of Words and TF-IDF
For document classification tasks (spam detection, sentiment analysis, topic categorization), simpler representations often worked well:
๐ Bag of Words
- Represent document as word counts
- Ignores word order entirely
- "The cat sat on the mat" = {the:2, cat:1, sat:1, on:1, mat:1}
- Simple but surprisingly effective
๐ TF-IDF
- Term Frequency: How often word appears in document
- Inverse Document Frequency: Penalize common words
- TF-IDF = TF ร log(N/df)
- Still used in search engines today
Hidden Markov Models (HMMs)
For sequence labeling tasks like part-of-speech tagging (assigning noun, verb, adjective to each word), Hidden Markov Models became the dominant approach. HMMs model sequences where the underlying "state" (the part of speech) is hidden, and we only observe the words.
# Given a sentence, find the most likely sequence of tags
sentence = ["The", "cat", "sat", "quickly"]
# HMM estimates two types of probabilities:
# 1. Transition probabilities: P(tag_i | tag_{i-1})
P(NOUN | DET) = 0.6 # After "the", nouns are common
P(VERB | NOUN) = 0.4 # After a noun, verbs are common
# 2. Emission probabilities: P(word | tag)
P("cat" | NOUN) = 0.01 # "cat" is a noun
P("sat" | VERB) = 0.005 # "sat" is a verb
# Viterbi algorithm finds: DET โ NOUN โ VERB โ ADV
tags = ["DET", "NOUN", "VERB", "ADV"]
The IBM Translation Models (1990s)
IBM researchers pioneered statistical machine translation by learning to align words between languages from parallel corpora (texts in two languages that are translations of each other).
The key insight: translation could be decomposed into a language model (how likely is this English sentence?) and a translation model (how likely is this English given the French?). Using Bayes' theorem:
This approach dominated machine translation for nearly two decades, culminating in systems like Google Translate (pre-neural version).
10.3 Word Embeddings: Meaning as Geometry (2013)
The breakthrough that bridged statistical and neural NLP came from a deceptively simple idea: represent words as dense vectors in a continuous space, where similar words are close together.
The Distributional Hypothesis
The theoretical foundation comes from linguistics:
"You shall know a word by the company it keeps."
— J.R. Firth (1957)
Words that appear in similar contexts have similar meanings. "Dog" and "cat" appear near words like "pet," "fur," "veterinarian." "King" and "queen" appear near "royal," "throne," "crown."
Word2Vec: The Revolution (2013)
Tomas Mikolov and colleagues at Google introduced Word2Vec, which trained neural networks to predict context words. The hidden layer weights became word embeddings.
Skip-gram
- Given center word, predict context
- Input: "cat"
- Predict: "the", "sat", "on", "mat"
- Better for rare words
CBOW (Continuous Bag of Words)
- Given context, predict center word
- Input: "the", "sat", "on", "mat"
- Predict: "cat"
- Faster to train
The Magic of Vector Arithmetic
Word2Vec's most famous result was that semantic relationships became vector operations:
# These relationships emerged automatically from text!
# Gender relationship
vector("king") - vector("man") + vector("woman") โ vector("queen")
# Capital cities
vector("paris") - vector("france") + vector("italy") โ vector("rome")
# Verb tense
vector("walking") - vector("walk") + vector("swim") โ vector("swimming")
# Superlatives
vector("biggest") - vector("big") + vector("small") โ vector("smallest")
The model learned these relationships purely from word co-occurrence patterns—no explicit knowledge was provided!
Why does this work? The direction from "man" to "woman" captures the concept of "gender change." The direction from "France" to "Paris" captures "capital of." When you add and subtract these directions, you're combining concepts algebraically. It's not perfect, but it's remarkable that it works at all—and that it emerged automatically from prediction.
GloVe: Combining Count-Based and Neural Methods (2014)
Stanford's GloVe (Global Vectors) combined the insights of count-based methods (like TF-IDF) with neural embedding learning. Instead of predicting context words directly, GloVe trained on word co-occurrence statistics from the entire corpus.
The Limitation: One Vector Per Word
Word2Vec and GloVe assign exactly one vector to each word. But words have multiple meanings:
- "bank": financial institution vs. river bank
- "cell": biological cell vs. prison cell vs. phone
- "right": correct vs. opposite of left vs. political leaning
These static embeddings average all meanings together. Solving this required models that could understand context—which led to the next breakthrough.
10.4 Sequence Models: Learning to Remember
To capture context, we need models that process sequences of words and maintain a "memory" of what they've seen. Enter Recurrent Neural Networks (RNNs).
The RNN Architecture
An RNN processes one word at a time, maintaining a hidden state that gets updated at each step:
# Processing "The cat sat on the mat"
h_0 = [0, 0, 0, ...] # Initial hidden state (zeros)
# Step 1: Process "The"
h_1 = tanh(W_h @ h_0 + W_x @ embed("The") + b)
# Step 2: Process "cat" - h_1 carries info about "The"
h_2 = tanh(W_h @ h_1 + W_x @ embed("cat") + b)
# Step 3: Process "sat" - h_2 carries info about "The cat"
h_3 = tanh(W_h @ h_2 + W_x @ embed("sat") + b)
# ... and so on
# The hidden state h_t encodes context from all previous words
# But information from early words gets "diluted" over time
The Vanishing Gradient Problem
Standard RNNs have a critical flaw: during training, gradients either vanish (shrink to nearly zero) or explode (grow uncontrollably) when backpropagating through many time steps. This means:
- The network struggles to learn long-range dependencies
- Information from early words gets lost
- In "The cat that the dog chased ran away," connecting "cat" to "ran" is hard
LSTMs: Learning What to Remember (1997)
Long Short-Term Memory networks, invented by Sepp Hochreiter and Jรผrgen Schmidhuber, solved this with a clever architecture using gates that control information flow:
๐ช Forget Gate
Decides what information from the previous cell state to discard. "Should I still remember that the subject was 'cat'?"
๐ฅ Input Gate
Decides what new information to store. "Is this word important enough to remember?"
๐ค Output Gate
Decides what to output based on the cell state. "What's relevant for the current prediction?"
The gates allow the network to maintain information over long distances. The key is the cell state—a "highway" that information can flow along unchanged. The forget gate can be set to nearly 1.0, meaning "keep everything," allowing gradients to flow unimpeded. This architectural innovation was crucial—and the idea of gating information flow appears again in transformers.
Sequence-to-Sequence Models (2014)
For tasks like translation, where input and output lengths differ, Sutskever et al. introduced the encoder-decoder architecture:
# Translating "The cat sat" โ "Le chat s'assit"
# ENCODER: Process input and compress to a single vector
h_1 = LSTM("The", h_0)
h_2 = LSTM("cat", h_1)
h_3 = LSTM("sat", h_2)
context = h_3 # This single vector must encode the entire input!
# DECODER: Generate output one word at a time
s_1 = LSTM("<START>", context) โ "Le"
s_2 = LSTM("Le", s_1) โ "chat"
s_3 = LSTM("chat", s_2) โ "s'assit"
s_4 = LSTM("s'assit", s_3) โ "<END>"
This was the state of the art for machine translation in 2014-2016. But there was a fundamental problem: the entire input had to be compressed into a single fixed-size vector. For long sentences, this bottleneck caused severe information loss.
10.5 Attention: The Breakthrough (2014-2017)
The attention mechanism, introduced by Bahdanau et al. (2015), solved the bottleneck problem with an elegant idea: instead of forcing everything through one vector, let the decoder look back at all encoder states and decide which ones are relevant at each step.
How Attention Works
At each decoding step, attention computes a weighted sum of all encoder hidden states. The weights reflect "how relevant is each input word for generating this output word?"
# Generating the French word "chat" (cat)
# Encoder states from "The cat sat"
encoder_states = [h_the, h_cat, h_sat]
# Current decoder state
decoder_state = s_1
# Compute attention scores (how relevant is each encoder state?)
scores = [dot(s_1, h_the), # 0.1 - "The" not very relevant
dot(s_1, h_cat), # 0.8 - "cat" highly relevant!
dot(s_1, h_sat)] # 0.1 - "sat" not very relevant
# Convert to probabilities with softmax
attention_weights = softmax(scores) # [0.1, 0.8, 0.1]
# Weighted sum of encoder states
context = 0.1*h_the + 0.8*h_cat + 0.1*h_sat
# Use context to help predict "chat"
output = predict(decoder_state, context) # โ "chat"
Attention learns which input words "align" with which output words. When translating "chat," the model focuses on "cat." When generating a verb, it attends to the input verb. This alignment emerges automatically from training—no one told the model which words correspond to which.
The Transformer Insight (2017)
The landmark paper "Attention Is All You Need" by Vaswani et al. asked a radical question: what if we got rid of recurrence entirely and used only attention?
The key innovation was self-attention: instead of the decoder attending to encoder states, every word attends to every other word in the same sequence.
| RNN/LSTM | Transformer |
|---|---|
| Processes words sequentially | Processes all words in parallel |
| Slow to train (can't parallelize across time) | Fast to train (massively parallelizable) |
| Information travels through hidden states | Every word can directly attend to every other |
| Struggles with very long sequences | Handles long sequences (up to context limit) |
| O(n) sequential operations | O(1) sequential operations, O(nยฒ) attention |
10.6 Transformers: The Architecture That Changed Everything
The Transformer architecture has become the foundation of virtually all modern language AI. Let's understand its key components.
Self-Attention: The Core Mechanism
In self-attention, each word computes a query, key, and value:
# For each word, compute Q, K, V by linear projection
Q = X @ W_Q # Query: "What am I looking for?"
K = X @ W_K # Key: "What do I contain?"
V = X @ W_V # Value: "What do I contribute?"
# Attention scores: how much should each word attend to each other word?
scores = Q @ K.T / sqrt(d_k) # Scale by dimension
# Softmax to get probabilities
attention = softmax(scores)
# Output: weighted sum of values
output = attention @ V
Think of it like a library: Query = "I'm looking for books about cats", Key = "This shelf has books about animals", Value = "Here are the actual books." High query-key similarity means the value gets more weight.
Multi-Head Attention
Instead of single attention, transformers use multiple attention heads that each learn different patterns:
- Head 1 might learn syntactic dependencies (subject-verb agreement)
- Head 2 might learn semantic relationships (pronoun resolution)
- Head 3 might learn positional patterns (what's nearby)
Positional Encoding
Since attention is permutation-invariant (order doesn't matter by default), transformers add positional encodings to indicate where each word is in the sequence:
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
These sinusoidal functions allow the model to learn relative positions and generalize to sequence lengths not seen during training.
The Full Architecture
A transformer layer combines:
- Multi-head self-attention
- Layer normalization (stabilizes training)
- Feed-forward network (processes each position independently)
- Residual connections (allows gradients to flow)
Stack many such layers (12 for BERT-base, 96 for GPT-4), and you have a modern language model.
The transformer's power comes from several factors: (1) Every word can directly attend to every other, enabling global reasoning; (2) All positions are processed in parallel, enabling massive scaling; (3) Deep stacking with residual connections allows complex computations; (4) The architecture is remarkably amenable to scaling—bigger models consistently perform better. This last property enabled the scaling revolution that produced GPT-3 and beyond.
10.7 The Modern Era: Scale, RLHF, and Beyond
The Scaling Hypothesis
A remarkable empirical finding: language model performance improves predictably with scale. The scaling laws discovered by OpenAI and others show:
This meant that if you wanted better models, you "just" needed to make them bigger and train them on more data with more compute. This insight drove the race from millions to billions to trillions of parameters.
Timeline of Pre-trained Models
RLHF: Teaching Models to Be Helpful
Pre-training creates models that predict text well, but not necessarily models that are useful or safe. Reinforcement Learning from Human Feedback bridges this gap:
- Supervised Fine-Tuning: Train on examples of good conversations
- Reward Model: Train a model to predict which responses humans prefer
- RL Optimization: Optimize the language model to generate responses the reward model rates highly
RLHF is part of the broader "alignment" challenge: ensuring AI systems do what we actually want, not just what we literally said. A model optimized purely for engagement might learn to be manipulative. One optimized for helpfulness might overpromise. Getting this right is one of the central challenges in AI safety.
Emergent Capabilities
Large language models exhibit capabilities that weren't explicitly trained for and weren't present in smaller models:
- In-context learning: Learning new tasks from examples in the prompt
- Chain-of-thought reasoning: Solving problems step-by-step when prompted to "think aloud"
- Code generation: Writing functional programs from descriptions
- Multilingual transfer: Capabilities learned in one language appearing in others
Whether these are truly "emergent" or artifacts of measurement is debated, but the practical impact is undeniable.
10.8 Why This Matters for Economists
Understanding NLP history isn't just intellectual curiosity—it has direct implications for empirical research.
Text as Data in Economics
Modern NLP enables economists to analyze text at scale:
๐ฐ Policy Analysis
- Fed communications and market reactions
- Regulatory text and compliance costs
- Political speech and polarization
๐ Financial Applications
- Earnings call sentiment analysis
- News-based volatility prediction
- 10-K filing analysis
๐ญ Firm Behavior
- Patent text and innovation measurement
- Job postings and skill demand
- Product descriptions and market positioning
๐ฅ Social Science
- Survey open-ended responses
- Social media analysis
- Historical document analysis
Practical Implications for Research
- Simple classification (spam, sentiment): Traditional ML (logistic regression, random forests) often suffices and is more interpretable
- Similarity/clustering: Word embeddings (Word2Vec, GloVe) or sentence embeddings (Sentence-BERT)
- Named entity recognition: Fine-tuned BERT models
- Text generation, summarization, Q&A: Large language models (GPT, Claude)
- Domain-specific tasks: Consider fine-tuning or domain-specific models (e.g., FinBERT for finance)
- Reproducibility: LLM outputs can vary; document versions, prompts, and settings carefully
- Bias: Models reflect training data biases; validate on your specific domain
- Cost: API costs can add up; design efficient pipelines
- Interpretability: Neural models are harder to interpret than traditional methods
- Data contamination: Test data may have been in training data for large models
Key Papers Using NLP in Economics
- Gentzkow & Shapiro (2010): Media bias measurement using text
- Hansen & McMahon (2016): FOMC communication and forward guidance
- Bloom et al. (2018): Economic policy uncertainty index
- Ash et al. (2023): Text as Data in Economics (survey)