Implementing the Google's autocomplete feature with ngram language models - Part 6
In this final post, we will briefly examine the strengths and weaknesses of n-gram models and explore why it may be necessary to refine them for improved predictions.
What happens if there are no examples in the corpus ?
Imagine that someone, perhaps a bit confused, enters a query on Google and types "Wet is thr bzst" instead of "What is the best". With a bigram model, we would attempt to find the most probable words following "bzst". However, since "bzst" is not a valid word in English, it is likely that no valid word will follow it, resulting in a zero probability for all words $b$ ($P(b|bzst)=0$). According to the chain rule of probability, this would mean that all subsequent words in the sequence would have a zero probability as well.
$$P(Wet,is,thr,bzst,b)\approx P(Wet)P(is|Wet)P(thr|is)P(bzst|thr)\underbrace{P(b|bzst)}_{\text{=0}}=0$$
The issue becomes even more pronounced with trigrams or 4-grams, as the likelihood of encountering an invalid or highly unlikely word sequence increases. This leads to even more cases where probabilities are assigned as zero, further limiting the model’s effectiveness.
And we should not assume that this problem is limited to confused users. It’s also possible that a new word has emerged—such as a new trend, a product with an innovative name, or a newly coined term—and we don’t have any examples of it in our corpus. In such cases, the same issue arises: the probability for that word would be assigned as zero.
Matching genres and dialects is still not sufficient. Our models may still be subject to the problem of sparsity. For any n-gram that occurred a sufficient number of times, we might have a good estimate of its probability. But because any corpus is limited, some perfectly acceptable English word sequences are bound to be missing from it. That is, we’ll have many cases of putative "zero probability n-grams" that should really have some non-zero probability.
Speech and Language Processing
What is the solution ?
There are likely several solutions to this problem, but the most well-known and widely used approach is smoothing.
Laplace smoothing merely adds one to each count (hence its alternate name add-one smoothing). Since there are V words in the vocabulary and each one was incremented, we also need to adjust the denominator to take into account the extra V observations.
Speech and Language Processing
There are other examples of smoothing techniques, but to keep things simple, we will focus solely on Laplace smoothing.
Let’s recall how we assign a probability to each next word $b$ in the corpus. We estimate it is from relative frequency counts: take a very large corpus, count the number of times $c(w_{i})$ we see the word $w_i$, and count the number of times $c(w_{i}b)$ this is followed by $b$. $$P(b|w_{i}) = \dfrac{c(w_{i}b)}{c(w_{i})}$$
With Laplace smoothing, the new probability will be calculated as follows.
$$P_{Laplace}(b|w_{i}) =\dfrac{c(w_{i}b)+1}{\sum_{\mathclap{w}} (c(w_{i}w)+1)}$$
Note that we need to adjust the denominator to take into account the extra V observations. This operation adds complexity to the computation but ensures that we avoid zero probabilities. If a count is zero, the added value of one guarantees that the result is no longer zero.
It can take a long time to obtain results
As discussed in previous posts, we need to assign a probability to each word in the vocabulary and identify the most likely next word every time a new word is typed. This process can be resource-intensive, especially when dealing with a large vocabulary.
This is why the n-gram language model is most effective when the vocabulary size is manageable.
Final thoughts
If you wish to delve deeper into this topic, acquire the following book, which encompasses all the concepts emphasized in this series and delve into more advanced ones (recurrent neural networks and transformers notably).
Speech and Language Processing
Do not hesitate to contact me shoud you require further information.