In this blog, we'll explore the foundations of n-gram models, how they work, their strengths and limitations, and how they compare to more sophisticated approaches. Understanding these classical techniques provides valuable insights into the evolution of language modeling and why more advanced models were developed.
For example, given the sentence:
The quick brown fox jumps over the lazy ____
We might predict the next word as "dog," "cat," or "river" based on the context, but not function words like "the" or "this."
A language model is a machine learning model that predicts the next word in a sequence and assigns probabilities to each possible word. One of the simplest types of language models is the n-gram model.
An n-gram is a sequence of n words:
- A 1-gram (unigram) is a single word.
- A 2-gram (bigram) is a sequence of two words.
- A 3-gram (trigram) is a sequence of three words, and so on.
When we refer to an n-gram model, we mean a model that predicts the
How Do N-gram Models Work?
To understand n-gram models, let's begin with the fundamental task of computing the probability of a word
where:
is the count of occurrences of the sequence in a corpus. is the count of the history appearing in the corpus.
For example, given the sentence:
The quick brown fox jumps over the lazy ____
we want to compute the probability of the word "dog" appearing next:
If we had a large enough dataset, we could estimate these probabilities by counting occurrences. However, natural language is highly diverse, and many word combinations might rarely appear in training data, making it difficult to estimate probabilities directly. To address this, we need a more practical way to approximate probabilities.
The Markov Assumption and N-gram Approximation
This is where n-gram models come in. They simplify the problem by making a key assumption: the probability of a word depends only on the previous
This assumption drastically reduces the complexity of computing probabilities, making it feasible to estimate them using limited data.
For example, instead of:
we approximate it as:
if we use a bigram model (n=2), meaning we consider only the last word. Similarly, for a trigram model (n=3), we would consider the last two words:
Thus, in general, an n-gram model estimates probabilities as:
where:
- Bigram model (n=2):
- Trigram model (n=3):
By using this approximation, n-gram models allow us to compute word probabilities efficiently while capturing local word dependencies.
Estimating N-gram Probabilities
Given the Markov assumption, we can estimate n-gram probabilities using maximum likelihood estimation (MLE).
We get MLE estimates by counting occurrences of n-grams in a corpus and normalizing them to obtain probabilities. For example, to estimate bigram probabilities, we count the occurrences of each word pair
where:
is the count of the bigram . is the count of the history .
Practical Example: Estimating a Bigram Probability
Let's go through a practical example using The Verdict - Wikisource, the free online library.
Loading the Corpus
import re
from collections import Counter
from nltk.util import ngrams
with open("the-verdict.txt", "r") as f:
data = f.read()
words = data.replace("--", " ").lower().split()
len(words)
3731
Our corpus contains a total of 3731 words. Now, let's implement a bigram model.
Extracting Bigrams
We first extract all possible bigrams from the text.
bigrams = list(ngrams(words, 2))
bigrams[:10]
[('i', 'had'),
('had', 'always'),
('always', 'thought'),
('thought', 'jack'),
('jack', 'gisburn'),
('gisburn', 'rather'),
('rather', 'a'),
('a', 'cheap'),
('cheap', 'genius'),
('genius', 'though')]
Now, let's say we want to compute gisburn appearing after jack.
Step 1: Count All Bigrams Starting with 'Jack'
We extract all bigrams where "jack" is the first word to find possible next words.
jack_bigrams = [bigram for bigram in bigrams if bigram[0] == 'jack']
print(jack_bigrams)
[('jack', 'gisburn'), ('jack', 'gisburn!'), ('jack', 'one'), ('jack', 'himself,'), ('jack', 'himself')]
There are 5 bigrams where "jack" is the first word.
Step 2: Count Occurrences of ('Jack', 'Gisburn')
We count how many times "gisburn" follows "jack" in the corpus.
jack_gisburn_count = sum(1 for bigram in jack_bigrams if bigram[1] == 'gisburn')
jack_gisburn_count
1
The bigram ("jack", "gisburn") appears once.
Step 3: Compute Probability
Finally, we compute the probability:
jack_count = len(jack_bigrams)
jack_gisburn_prob = jack_gisburn_count / jack_count
jack_gisburn_prob
0.2
Thus, the probability of "gisburn" appearing after "jack" is 0.2 (or 20%).
Relative Frequency and Maximum Likelihood Estimation (MLE)
The ratio
is also called the relative frequency.
Using relative frequencies to estimate probabilities is a simple yet effective approach for computing maximum likelihood estimation (MLE) in n-gram models. However, directly multiplying probabilities can introduce computational challenges.
The Problem of Underflow and Log Probabilities
Language models can be very large, leading to practical issues—one of which is computing probabilities. Since probabilities are small values between 0 and 1, multiplying many probabilities together results in very small numbers, leading to computational underflow. To mitigate this, we often use log probabilities, which are more numerically stable.
Instead of:
we compute:
By summing log probabilities instead of multiplying them, we avoid numerical instability while preserving the relationships between probabilities.
Evaluating N-gram Models
To assess the performance of an n-gram model, we first split our data into training, development, and test sets. The model is trained on the training set, fine-tuned on the development set, and evaluated on the test set, which measures how well the model generalizes to unseen data.
Training set: Data used to learn the model parameters.
Development set (Dev set): A separate dataset used to tune hyperparameters and compare different models before final evaluation.
Test set: A held-out dataset used to assess the final model’s performance.
The development set is crucial because it helps prevent overfitting. If we tuned hyperparameters directly on the test set, we might unknowingly optimize the model for that specific dataset rather than ensuring generalization. By using a separate development set, we ensure that the test set remains an unbiased measure of the model's actual performance.
A good language model should generalize well to unseen text. A model that fits the training data perfectly but fails on new data is overfitting, making it ineffective in real-world applications. We measure the quality of an n-gram model by its performance on an unseen test corpus.
Perplexity
To evaluate n-gram models, we compare how well they assign probabilities to a test set. However, using raw probabilities isn’t ideal because:
- The probability of a test set decreases as the text length increases.
- Comparing models based on raw probability is not straightforward.
Instead, we use perplexity, a function of probability that normalizes for text length.
Definition of Perplexity
The perplexity (PP or PPL) of a language model on a test set is:
where:
is the perplexity of the test set . is the probability assigned to the test set. is the total number of words in the test set.
Interpreting Perplexity
Perplexity quantifies how uncertain or ‘surprised’ a model is when predicting text. A lower perplexity indicates:
- The model assigns higher probabilities to correct word sequences.
- The model is more confident in its predictions.
For example:
- A perfect model that assigns a probability of 1 to the correct test set would have a perplexity of 1.
- A random model that assigns equal probability to all words would have a high perplexity.
- A better-performing model will have a lower perplexity than a weaker one.
Since different language models yield different perplexities for the same text, perplexity is a useful metric for comparing models.
Limitations of N-gram Models
While n-gram models are simple and computationally efficient, they have several limitations:
Data Sparsity: As n increases, the number of possible n-grams grows exponentially, making it difficult to collect enough data to estimate probabilities accurately.
Fixed Context Window: N-gram models only consider a limited number of preceding words, ignoring long-range dependencies in language.
Poor Generalization: They rely on exact word matches, meaning they struggle with unseen n-grams or slight variations in phrasing.
High Memory Usage: Storing large n-gram counts requires significant memory, especially for higher-order models.
To address these limitations, modern language models use smoothing techniques (like Laplace and Kneser-Ney smoothing) and neural-based approaches like Recurrent Neural Networks (RNNs), LSTMs, and Transformers.
Conclusion
N-gram language models represent a foundational approach to language modeling. They offer a simple and interpretable way to estimate word probabilities but struggle with long-range dependencies and data sparsity. Despite their limitations, they have played a crucial role in NLP and continue to be useful in applications like text compression, speech recognition, and information retrieval.
While deep learning models like Transformers have largely replaced n-gram models for complex NLP tasks, understanding n-grams provides a valuable perspective on the evolution of language modeling techniques. By combining classical approaches with modern advancements, we can build more efficient and effective NLP systems.