Utilizing Bloom filters for efficient querying in large-scale datasets - Part 3

In this post, we will briefly explore a concrete example where the techniques mentioned earlier fall short of fully addressing the problem. We will then discuss why a new solution is required.

Processing millions of request every second

Imagine a simple scenario where someone is browsing the internet using Google Chrome, for example. To enhance security, Google Chrome checks whether each URL is potentially malicious with every request. To do this, Google maintains an extensive, constantly updated repository on a server.

Information

Note that this is a hypothetical scenario. We have no knowledge of whether Google actually maintains such a repository.

The process is simple: John enters the URL in the address bar, and Google Chrome checks its repository to see if the URL exists. If it does, the request is blocked; otherwise, the page is loaded as usual.

However, John is not alone and millions of people use Chrome simultaneously. This means that every second, the repository must handle millions of requests. Even if the repository is well-indexed or utilizes modern hashing techniques, it still demands enormous resources in terms of memory and CPU. In this case, the most efficient data retrieval techniques are insufficient to meet our requirements.

Bloom filters to the rescue

So, what could be a possible data structure, if any ? Ideally, it would be fantastic if, instead of constantly querying the repository, the browser could perform this operation locally, directly on the computer where it's installed. Sounds too good to be true ?

First of all, this data structure must be memory-efficient.

At first glance, this seems unfeasible because it would require storing locally the entire repository that Google keeps on its servers. Since this repository could contain hundreds of thousands of URLs, storing it on a local computer would be prohibitive. So, a crucial requirement is that this data structure must be memory-efficient and should not require storing all elements explicitly.

Then, we do not require (paradoxically) the data structure to always provide the correct answer.

In fact, we don't necessarily need a data structure that can definitively determine if a URL is malicious. We can tolerate occasional false positives, where it incorrectly labels a harmless URL as malicious, as long as the structure provides the correct results with a high probability.

Essentially, we have to design a lightweight structure that may occasionally make mistakes but still offers reliable outcomes most of the time. This data structure exists: enter the Bloom filter.

Utilizing Bloom filters for efficient querying in large-scale datasets - Part 4