Utilizing Bloom filters for efficient querying in large-scale datasets - Part 5

In this post, we will formalize what we previously discussed and provide a precise definition of a Bloom filter. We will then delve into the mathematical principles behind it, focusing on the optimization of its parameters.

From the previous post, we need to provide a formal computational translation of what we referred to as tags or characteristics and to design an efficient and lightweight data structure.

What is a Bloom filter ?

Essentially, a Bloom filter involves a bit array (large array of $m$ bits) and $k$ independent hash functions (these are functions that take an element and map it to a bit index in the array).

Important

The bit array serves as the mathematical formalization of the "summary" of the underlying datastore.

The hash functions serve as the mathematical formalization of the tagging we mentioned earlier: for each element, each hashing function generates a hash that acts as the signature of the element, essentially serving as a tag or categorization.

This definition might seem a bit abstract, so let's take a closer look at how it works in detail.

An element is added.

Let's revisit our example from Part 3 about malicious URLs in Google. Suppose we need to add the URL "www.hacking.com" to our Bloom filter.

  • Initially, our Bloom filter consists of a bit array where all bits are set to 0. This ensures that no elements are marked as present in the filter at the start.

  • Our URL, "www.hacking.com", is processed through each of the $k$ hash functions. Each hash function generates an index corresponding to a position in the bit array.

  • The bits at all these indexed positions in the bit array are then set to 1, marking the presence of the URL in the Bloom filter.

The illustration below demonstrates this operation with $m=10$ and $k=3$.

Suppose now we need to add the URL "www.hacking2.com".

In the latter example, we observe that the first hash function produces the same result for both URLs (it is a collision as we saw in the previous posts). In the next section, we will explore how these collisions contribute to false positives.

An element is retrieved.

Let's explore now how to check if an element is in the set. Suppose we need to verify that the url "www.iamahacker.com" is in our filter.

  • The URL is processed through the $k$ hash functions.

  • The bits at the $k$ resulting indices are examined.

  • If ANY of these bits is 0, it indicates that this URL has never been encountered before (otherwise, the bit would have been set to 1 as explained before), meaning the URL is definitely not in the set.

  • If ALL of these bits are set to 1, it suggests that the element is probably in the set.

Suppose now we need to verify that the url "www.hacking2.com" is in our filter.

As expected, the filter suggests that the element is probably present in the set.

Hold on ! I don't understand why we say "probably".

Imagine we're checking if the URL "www.hacking10.com" is in the set. We know it isn't, but what will the Bloom filter return ?

The phenomenon here is that all the hash functions return indices that have already been set to 1. These indices were not set to 1 by a specific URL but by multiple URLs (for example, bit 2 was set to 1 by www.hacking2.com, and bit 8 was set to 1 by www.hacking.com). As a result, the filter must indicate that the URL is probably in the set. However, this is a false positive caused by unavoidable collisions: the issue is that the hash functions can return the same indices for different URLs.

Very important

A Bloom filter is essentially a bit array that is associated with a large set of elements, serving as a clever summary of that set. This lightweight structure is designed to fit into memory (using just a single bit for each possible element allows us to store a large amount of information in a very compact way, while still enabling fast membership checks), making it highly efficient. When we need to check if an element belongs to the underlying set, we first consult the Bloom filter, allowing us to obtain a quick answer.

Optimizing a Bloom filter

Now that we have a better understanding of how a Bloom filter works, a natural question arises: can we analyze it mathematically ? Is it possible to estimate the probability of false positives ? And can we adjust the parameters (such as the number of hash functions and the size of the bit array) to minimize this probability ?

Information

For a thorough understanding of mathematical algorithm analysis and its application through simple examples, please refer to this article: Understanding time complexity on simple examples.

What are the parameters we have available ?

Two key parameters are at our disposal: the number $m$ of bits in the bit array and the number $k$ of hash functions. The goal is to determine whether an optimal combination of these parameters exists to create an extremely efficient data structure.
We will also need another variable (although we have less control over it): the total number $n$ of elements to insert. Naturally, we assume that $n$ is so large that storing all these elements in memory is not feasible.

Estimating the probability of a false positive

We will denote by $q_{11}$ the probability that a bit in the bit array is set to 1 after the insertion of the first element using the first hash function. As there are $m$ bits in the array, it is quite evident that $q_{11}=\dfrac{1}{m}$.

Consequently, if we denote $p_{11}$ as the probability that a bit still being 0 after this operation, we obtain that $p_{11}=1-\dfrac{1}{m}$. Finally, if we denote $p_{1k}$ as the probability that a bit remains 0 after applying all $k$ hash functions, we obtain $p_{1k}=(1-\dfrac{1}{m})^k$. This follows from the crucial assumption that the hash functions are independent.

Now, let $p_{nk}$ be the probability that a bit remains 0 after inserting $n$ elements and applying $k$ hash functions. We then have $p_{nk}=(1-\dfrac{1}{m})^{nk}$. A Taylor series expansion of the logarithm then provides the following formula.

$$p_{nk}=(1-\dfrac{1}{m})^{nk} \approx e^{\frac{-kn}{m}}$$

From this point onward, we can estimate the probability $f_{kn}$ of a false positive. This occurs when all $k$ hash functions return 1. Again, the hash functions are assumed to be independent, and so we obtain the following estimation.

$$\boxed{f_{nk}=(1-p_{nk})^k=e^{kln(1-p_{nk})}}$$

Minimizing this probability

Our goal now is to minimize this probability. How can we choose $m$ and $k$ to achieve this ?
For simplicity, we will assume that both $m$ and $n$ are fixed, meaning we know how many items we need to process and have chosen the number of bits we want to use to summarize them.

We have $f_{nk}=e^{kln(1-p_{nk})}=e^{-\frac{m}{n} ln(p_{nk}) ln(1-p_{nk})}$ (remember that $p_{nk}= e^{\frac{-kn}{m}}$ and so $k=-\frac{m}{n}ln(p_{nk})$).

Minimizing $f_{nk}$ is equivalent to minimizing $p_{nk} \longmapsto -\frac{m}{n} ln(p_{nk}) ln(1-p_{nk})$. The minimum is obtained for $p_{nk}=\dfrac{1}{2}$.

Thus, $\boxed{k_{min}=\frac{m}{n}ln 2}$ and we have the important following result.

$$f_{nk_min}=(\dfrac{1}{2})^{k_{min}} \approx (0.6185)^\frac{m}{n}$$

By letting $c=\dfrac{m}{n}$, we finally have $f_{nk_min} \approx (0.6185)^c$.

A few examples

If $c=8$, $k_{min}=6$ and in this case

$$f_{nk_min} \approx 0.0214$$

If $c=16$, $k_{min}=12$ and in this case

$$f_{nk_min} \approx 0.0005$$

In this case, it means that out of 2,000 requests, only one is expected to be a false positive on average. This demonstrates the efficiency of the Bloom filter in minimizing incorrect results while maintaining a lightweight structure.

Important

Hold on a second ! The number of bits required for a Bloom filter is significantly larger than the number of elements in the initial set ($c \ge 1$). So, what is the benefit of using this data structure ?

That's correct, but a bit array consists of bits, and bits are very inexpensive to store in memory. For example, with $n=1000000$, we have $m=16000000$ if $c=16$. And 16000000 bits $\approx$ 1.9Mo. An array of that size can easily fit into the memory of modern computers.

But enough with the theory! Let's now look at how to concretely implement a Bloom filter in C#.

Utilizing Bloom filters for efficient querying in large-scale datasets - Part 6