Utilizing Bloom filters for efficient querying in large-scale datasets - Part 4
In this post, we will provide an informal introduction to Bloom filters using a simple analogy with a stamp collection.
The analogy presented here is inspired by a reference from this blog: StackUp.
What we describe below is merely an analogy and should be taken as such. It serves to illustrate the concept and clarify ideas, rather than precisely depicting the inner workings of Bloom filters. For a more detailed explanation, refer to the next post.
Imagine that I am a stamp collector
Imagine that I own a vast stamp collection, inherited from my grandparents and parents, and expanded through my own travels around the world. As a result, I now possess hundreds of thousands of stamps, meticulously compiled in numerous albums. However, I can no longer remember precisely which stamps I own. So, if my wife decides to surprise me with a stamp for my birthday, I have no way of immediately knowing whether I already have it in my collection.
Of course, in this case, I could store all these stamps in a dedicated data structure, such as a simple index similar to those found at the end of a book. This would save me from flipping through every album page, but it would also require a significant amount of storage space.
That’s why, to streamline the process, I will select a small set of key characteristics such as color, country of origin, and year, and tag each stamp accordingly (as shown below).
How these categories are stored and queried is not important for our example (though it would be in practice). The goal here is to provide an intuitive, informal understanding of how Bloom filters work.
I specifically mentioned that I have a small set of characteristics. Why ? Because this approach minimizes memory usage. Only a few bits per stamp will be required.
How can I then quickly check if a specific stamp is already in my collection ?
How can I quickly determine if the stamp my wife wants to give me is already in my collection ? The process is quite simple: I compare the characteristics of the new stamp with those of the existing ones. From this comparison, two mutually exclusive outcomes are possible.
Case 1: I may find an existing stamp that matches ALL the specified characteristics.
Since a stamp in my collection was already tagged with "red", "USSR" and "1965", it appears that the new stamp might already be there. However, in reality, it isn't (the existing stamp is smaller). This situation occurs because the new stamp coincidentally shares the same characteristics as another, leading to a false positive.
We can't distinguish between the two stamps because we only track a limited set of characteristics, not the actual full details fo each stamp.
StackUp
Why not add an additional tag for the stamp's size ? This would help improve differentiation and reduce false positives. Yes, that's a valid question. However, the more tags we add, the heavier the filter becomes, making it impractical to store in memory. We will revisit this point later in this post (see below).
Case 2: Even ONE of the characteristics is not marked.
If even one of the characteristics is not tagged, we can be certain that the stamp is not in the collection.
Since I have no stamps in my collection from 1975, I know for sure that this stamp is not part of my collection.
Okay, but why is this better than using a simple B-tree, for instance ?
That's a great question. One might wonder what advantages we gain from using this new data structure. After all, we still have to maintain it and add a new tag each time we acquire a new stamp. What's innovative about this compared to the structures we discussed in previous posts ?
It turns out that the data structures we’ve used so far must store ALL the characteristics of a stamp: not only its origin, year, or color, but also its size, what it represents, the type of paper it's made from, and so on. As a result, these data structures cannot fit entirely in memory and require disk access to check whether an element exists and these operations are known to be time-consuming. On the other hand, a Bloom filter ONLY stores a subset of characteristics and can fit entirely in RAM. Therefore, it serves as a sort of summary (proxy) of the entire collection, allowing for efficient querying.
Obviously, there is no such thing as a free lunch: since I only have a kind of summary, there are times when I might not get accurate results. These incorrect results are called false positives (the Bloom filter indicates that it contains an element, even though the underlying data structure does not). The challenge, then, is to design a filter with enough characteristics to be useful, but not too many, so that it fits in memory.
Imagine, for example, that I only tag my stamps with the country of origin (see above). There could be around 300 different possible tags (representing the number of countries in the world, with some disappearing and others emerging). With millions of stamps in my collection, it would be highly unlikely that I don’t have stamps from every country. As a result, my filter would be extremely lightweight but ultimately ineffective.
Conversely, if I use too many tags, my filter will become highly discriminative but will be too memory-intensive.
This is the core philosophy behind a Bloom filter: to have enough discriminative tags to be useful, while avoiding an excess that would make it too memory-heavy.
Bloom filters are highly efficient data structures for set membership queries in situations where memory is constrained and a small probability of error (false positives) is acceptable. They leverage the power of hashing and the tradeoff between space and accuracy to provide fast, scalable solutions.
We will formalize this intuition in the next post.
Utilizing Bloom filters for efficient querying in large-scale datasets - Part 5