Utilizing Bloom filters for efficient querying in large-scale datasets - Part 2

In this post, we will explore various techniques for storing and querying massive amounts of data, with a particular emphasis on hashing.

What are some techniques to store data ?

Storing data is as old as civilization itself, and practitioners have faced this challenge since the early days of computing. Storing massive amounts of data is even more complex, requiring different techniques depending on specific requirements and available resources. Here, we will briefly outline some techniques that have been developed over the past decades but will not go into the details here and instead refer interested readers to specialized textbooks for a more in-depth discussion.

Searching methods generally fall into two main strategies: utilizing binary trees or employing hashing techniques.

Using binary trees

A binary search tree (BST) is a hierarchical data structure used for efficient searching, insertion, and deletion of elements. It follows these key properties:

  • Each node has at most two children: a left child and a right child (binary structure).
  • For any given node:
  • The left subtree contains values smaller than the node’s value.
  • The right subtree contains values greater than the node’s value.

This second property is crucial as it allows for highly efficient searching. By consistently directing the search either left or right based on comparisons, we can quickly locate elements, reducing the average search time to $O(ln(n))$ in a balanced tree: start from the root and compare the target value with the node, we move left if the target is smaller, right if it’s larger and this continues until the value is found or the search reaches a null reference.

BSTs offer additional advantages: they provide excellent performance for finding the minimum or maximum values, as well as for performing range queries efficiently. This efficiency makes them a fundamental data structure in computer science.

Can we achieve even better performance ? As is often the case, there is no such thing as a free lunch. While there are data structures that offer exceptional search efficiency, they often come at the cost of other important operations, such as finding the minimum or maximum values or performing range queries. Hashing is one such technique: we will see that it enables searches to be performed in constant time, provided we are willing to sacrifice efficiency in other operations.

Using hashing

In certain applications, our sole concern is determining whether an element belongs to a given set, without needing to find the minimum or maximum. In such cases, a binary search tree may be excessive, consuming more space and time than required.

Hashing is a technique used to store and retrieve data efficiently by mapping elements to fixed-size values called hash codes, using a hash function. These hash codes are then used as indices in a data structure called a hash table, allowing for near-instantaneous lookup times in the average case.
This definition is quite abstract and uses specialized terms (such as hash functions and hash codes), so let's clarify and provide more details.

Leveraging arrays

Arrays are highly efficient structures because each element can be accessed directly using its indices. Each of these accesses can be performed in constant time ($O(1)$).

Hashing techniques take advantage of arrays and their constant-time access properties: it involves mapping data to a specific index in an array of items.

Imagine we have a set of thousands of strings that we need to query efficiently. Hashing involves using a special function that maps each string to an integer within a specific range. For example, the string "element_to_store_1" might be mapped to the index 2. The corresponding position in the array at index 2 will then store a reference to this string (as illustrated in the diagram).
By applying this method to all strings in the set, we create an array of references that enables highly efficient querying. When retrieving a specific string, we first use the special function to map it to an integer, then simply access the element stored at the corresponding position in the array.

Information

The average retrieval time is $O(\frac{n}{m})$, where $n$ represents the number of elements in the initial set and $m$ denotes the size of the pointer array. The proof of this result can be found in Algorithms.

Important

The special function is called a hash function, and the underlying array of references is known as a hash table.

Wait a minute ! What happens if the hash function maps to two different elements to the same position ?

Yes, that’s a great observation! It is entirely possible for a hash function to map two different elements to the same position in the hash table, as shown in our example above (element_to_store_1 and element_to_store_123780 both map to 2). This phenomenon is called a collision, and there are various techniques to handle it.

We won’t go into too much detail, but in our case, we’ve illustrated a separate chaining resolution: when multiple elements are hashed to the same slot, they are stored in a singly linked list, known as a chain.

To be remembered

Hashing is incredibly powerful for quickly retrieving an element from a set or efficiently checking its existence—it even outperforms a BST in these tasks. However, if we need to perform operations like finding the minimum or maximum of a set, hashing becomes completely unsuitable. Unlike BSTs, which maintain a sorted structure, hash tables store elements in a way that provides fast lookups but lacks any inherent order.

If you need to quickly find the maximum or minimum key, find keys in a given range (...), then hashing is not appropriate, since these operations will all take linear time.
Algorithms

And what if we have so many elements that they no longer fit in memory ?

The binary trees and hashing techniques we've discussed so far are highly effective when the underlying set can be stored in memory. However, in some applications, such as large databases, the data may be too vast to fit into memory, requiring us to turn to more advanced techniques to address these challenges.

The counterpart to binary search trees (BST) when memory is limited are B-trees.

B-trees are self-balancing search trees designed for efficient storage and retrieval of large amounts of data, especially on disk-based storage systems.

  • Nodes can have multiple children: unlike BSTs (where each node has at most two children), a B-tree node can have multiple keys and children, making it well-suited for storage systems that work with large blocks of data.

  • Since B-trees minimize the number of reads and writes required for operations, they are widely used in databases and file systems to optimize disk I/O operations (efficient disk access).

Important

We won’t delve further into the details of B-Trees. The key takeaway is that there are data structures equivalent to BSTs that are specifically designed to retrieve elements efficiently from datasets so large that they cannot fit in memory.

And what is the counterpart to hashing when memory is limited ?

It is entirely possible to build hash tables that store elements directly on disk. While these structures function correctly, every retrieval requires a disk access, which remains a costly operation. In the remainder of this article, we will develop a partial solution to hashing when the underlying set cannot fit into memory.

But before diving into the details, we will first take some time to explore why the various data structures we have discussed, despite their efficiency, may not always be suitable for certain situations.

Utilizing Bloom filters for efficient querying in large-scale datasets - Part 3