Implementing the Google's autocomplete feature with ngram language models - Part 1

Starting today, we launch a series of articles delving into the implementation of the Google's autocomplete feature (we will detail later what it is exactly). More broadly, our aim is to explore the n-gram language model, shedding light on how to predict and generate sentence endings based on their beginnings. Our goal is to seamlessly blend theoretical concepts with practical applications, offering insights into the challenges encountered along the way.

What are we aiming to achieve ?

In natural language processing (NLP), a common task involves predicting the completion of a sentence based on its initial words. For example, how would a sentence that begins with "My name is" likely conclude ? Algorithms capable of performing this task are commonly known as language models.
In this article, we will specifically focus on a basic and straightforward implementation of a language model commonly referred to in the literature as n-gram models.

Using ngram models for the Google's autocomplete feature

We will demonstrate that, despite its simplicity, this language model can be effectively applied to everyday tasks, such as powering the Google's autocomplete feature: that is, predicting text functionality designed to assist users by suggesting possible completions to their search queries as they type.

Important

In reality, the current Google's autocomplete feature relies on more advanced algorithms, including sophisticated language models, to analyze patterns from vast search datasets and predict what users might intend to search for. These methods incorporate concepts such as embeddings, recurrent neural networks, and other cutting-edge techniques like transformers. However, the purpose here is to demonstrate that this task can still be accomplished using simple mathematics and minimal code.

We referred to the following book to elucidate certain concepts.

Speech and Language Processing (Jurafsky, Martin)

Without further ado and as usual, let's begin with a few prerequisites to correctly understand the underlying concepts. Continue here.