Language Modeling - N-gram

Language Modeling - N-gram

• 10 min read

In this blog, we'll explore the foundations of n-gram models, how they work, their strengths and limitations, and how they compare to more sophisticated approaches. Understanding these classical techniques provides valuable insights into the evolution of language modeling and why more advanced models were developed.

For example, given the sentence:

The quick brown fox jumps over the lazy ____

We might predict the next word as "dog," "cat," or "river" based on the context, but not function words like "the" or "this."

A language model is a machine learning model that predicts the next word in a sequence and assigns probabilities to each possible word. One of the simplest types of language models is the n-gram model.

An n-gram is a sequence of n words:

When we refer to an n-gram model, we mean a model that predicts the nth word given the previous n1 words. These models provide a simple yet effective way to estimate probabilities in a text sequence and serve as a foundation for more advanced techniques.

How Do N-gram Models Work?

To understand n-gram models, let's begin with the fundamental task of computing the probability of a word w given some history h:

P(w|h)=C(w,h)C(h)

where:

For example, given the sentence:

The quick brown fox jumps over the lazy ____

we want to compute the probability of the word "dog" appearing next:

P("dog"|"the quick brown fox jumps over the lazy")=C("the quick brown fox jumps over the lazy dog")C("the quick brown fox jumps over the lazy")

If we had a large enough dataset, we could estimate these probabilities by counting occurrences. However, natural language is highly diverse, and many word combinations might rarely appear in training data, making it difficult to estimate probabilities directly. To address this, we need a more practical way to approximate probabilities.

The Markov Assumption and N-gram Approximation

This is where n-gram models come in. They simplify the problem by making a key assumption: the probability of a word depends only on the previous n1 words, rather than the entire history. This is known as the Markov assumption:

P(w|h)P(w|hn1,hn2,...,h1)

This assumption drastically reduces the complexity of computing probabilities, making it feasible to estimate them using limited data.

For example, instead of:

P("dog"|"the quick brown fox jumps over the lazy")

we approximate it as:

P("dog"|"lazy")

if we use a bigram model (n=2), meaning we consider only the last word. Similarly, for a trigram model (n=3), we would consider the last two words:

P("dog"|"over the lazy")

Thus, in general, an n-gram model estimates probabilities as:

P(w|h)P(w|hn1,hn2,...,h1)

where:

By using this approximation, n-gram models allow us to compute word probabilities efficiently while capturing local word dependencies.

Estimating N-gram Probabilities

Given the Markov assumption, we can estimate n-gram probabilities using maximum likelihood estimation (MLE).

We get MLE estimates by counting occurrences of n-grams in a corpus and normalizing them to obtain probabilities. For example, to estimate bigram probabilities, we count the occurrences of each word pair (h,w) and divide by the total count of history h:

P(w|h)=C(h,w)C(h)

where:

Practical Example: Estimating a Bigram Probability

Let's go through a practical example using The Verdict - Wikisource, the free online library.

Loading the Corpus

import re
from collections import Counter
from nltk.util import ngrams

with open("the-verdict.txt", "r") as f:
    data = f.read()

words = data.replace("--", " ").lower().split()
len(words)
3731

Our corpus contains a total of 3731 words. Now, let's implement a bigram model.

Extracting Bigrams

We first extract all possible bigrams from the text.

bigrams = list(ngrams(words, 2))
bigrams[:10]
[('i', 'had'),
 ('had', 'always'),
 ('always', 'thought'),
 ('thought', 'jack'),
 ('jack', 'gisburn'),
 ('gisburn', 'rather'),
 ('rather', 'a'),
 ('a', 'cheap'),
 ('cheap', 'genius'),
 ('genius', 'though')]

Now, let's say we want to compute P("gisburn"|"jack"), the probability of the word gisburn appearing after jack.

Step 1: Count All Bigrams Starting with 'Jack'

We extract all bigrams where "jack" is the first word to find possible next words.

jack_bigrams = [bigram for bigram in bigrams if bigram[0] == 'jack']
print(jack_bigrams)
[('jack', 'gisburn'), ('jack', 'gisburn!'), ('jack', 'one'), ('jack', 'himself,'), ('jack', 'himself')]

There are 5 bigrams where "jack" is the first word.

Step 2: Count Occurrences of ('Jack', 'Gisburn')

We count how many times "gisburn" follows "jack" in the corpus.

jack_gisburn_count = sum(1 for bigram in jack_bigrams if bigram[1] == 'gisburn')
jack_gisburn_count
1

The bigram ("jack", "gisburn") appears once.

Step 3: Compute Probability

Finally, we compute the probability:

P("gisburn"|"jack")=C("jack", "gisburn")C("jack")
jack_count = len(jack_bigrams)
jack_gisburn_prob = jack_gisburn_count / jack_count
jack_gisburn_prob
0.2

Thus, the probability of "gisburn" appearing after "jack" is 0.2 (or 20%).

Relative Frequency and Maximum Likelihood Estimation (MLE)

The ratio

C("jack", "gisburn")C("jack")

is also called the relative frequency.

Using relative frequencies to estimate probabilities is a simple yet effective approach for computing maximum likelihood estimation (MLE) in n-gram models. However, directly multiplying probabilities can introduce computational challenges.

The Problem of Underflow and Log Probabilities

Language models can be very large, leading to practical issues—one of which is computing probabilities. Since probabilities are small values between 0 and 1, multiplying many probabilities together results in very small numbers, leading to computational underflow. To mitigate this, we often use log probabilities, which are more numerically stable.

Instead of:

p1×p2×p3×p4×p5

we compute:

elog(p1)+log(p2)+log(p3)+log(p4)+log(p5)

By summing log probabilities instead of multiplying them, we avoid numerical instability while preserving the relationships between probabilities.

Evaluating N-gram Models

To assess the performance of an n-gram model, we first split our data into training, development, and test sets. The model is trained on the training set, fine-tuned on the development set, and evaluated on the test set, which measures how well the model generalizes to unseen data.

Training set: Data used to learn the model parameters.

Development set (Dev set): A separate dataset used to tune hyperparameters and compare different models before final evaluation.

Test set: A held-out dataset used to assess the final model’s performance.

The development set is crucial because it helps prevent overfitting. If we tuned hyperparameters directly on the test set, we might unknowingly optimize the model for that specific dataset rather than ensuring generalization. By using a separate development set, we ensure that the test set remains an unbiased measure of the model's actual performance.

A good language model should generalize well to unseen text. A model that fits the training data perfectly but fails on new data is overfitting, making it ineffective in real-world applications. We measure the quality of an n-gram model by its performance on an unseen test corpus.

Perplexity

To evaluate n-gram models, we compare how well they assign probabilities to a test set. However, using raw probabilities isn’t ideal because:

Instead, we use perplexity, a function of probability that normalizes for text length.

Definition of Perplexity

The perplexity (PP or PPL) of a language model on a test set is:

PP(W)=P(w1,w2,w3,...,wN)1N

where:

Interpreting Perplexity

Perplexity quantifies how uncertain or ‘surprised’ a model is when predicting text. A lower perplexity indicates:

For example:

Since different language models yield different perplexities for the same text, perplexity is a useful metric for comparing models.

Limitations of N-gram Models

While n-gram models are simple and computationally efficient, they have several limitations:

Data Sparsity: As n increases, the number of possible n-grams grows exponentially, making it difficult to collect enough data to estimate probabilities accurately.

Fixed Context Window: N-gram models only consider a limited number of preceding words, ignoring long-range dependencies in language.

Poor Generalization: They rely on exact word matches, meaning they struggle with unseen n-grams or slight variations in phrasing.

High Memory Usage: Storing large n-gram counts requires significant memory, especially for higher-order models.

To address these limitations, modern language models use smoothing techniques (like Laplace and Kneser-Ney smoothing) and neural-based approaches like Recurrent Neural Networks (RNNs), LSTMs, and Transformers.

Conclusion

N-gram language models represent a foundational approach to language modeling. They offer a simple and interpretable way to estimate word probabilities but struggle with long-range dependencies and data sparsity. Despite their limitations, they have played a crucial role in NLP and continue to be useful in applications like text compression, speech recognition, and information retrieval.

While deep learning models like Transformers have largely replaced n-gram models for complex NLP tasks, understanding n-grams provides a valuable perspective on the evolution of language modeling techniques. By combining classical approaches with modern advancements, we can build more efficient and effective NLP systems.