Utilizing Bloom filters for efficient querying in large-scale datasets - Part 6
In this post, we will implement a Bloom filter in C# and demonstrate its functionality through a concrete application: we will show how it can be used to quickly check if a URL is malicious.
How to design k different hash functions ?
One of the challenges in designing a Bloom filter is ensuring that the hash functions are relatively independent. Achieving this is not a straightforward task. That's why we will refer to this article (Building a better Bloom filter), which describes a technique for generating multiple hash functions from a single hash function code.
We assume that we have a hash function that returns 128 bits. We then split this into two 64-bit integer values (long in C#). Let's call these values h1 and h2.
1long hash = h1;
2for (var i = 0; i < k; i++)
3{
4 hash += (k*h2);
5 //
6}
This process yields the necessary number of hash functions without the overhead of calling the hash calculation multiples times.
Coding the Bloom filter
For the purposes of our program, we will generate 10,000 URLs and use BenchmarkDotNet to conduct a simple comparison with the Contains method of the List class.
Defining a BloomFilter class
This class simulates the definition we provided for a Bloom filter. It utilizes the code above to generate hash functions.
1public class BloomFilter
2{
3 private readonly byte[] _bits;
4 private SHA256 _sha256;
5 private int _numberOfHashFunctions;
6
7 public long NumberOfBits { get; private set; }
8 public long NumberOfElements { get; private set; }
9
10 public BloomFilter(long numberOfBits, long numberOfElements)
11 {
12 NumberOfBits = numberOfBits;
13 NumberOfElements = numberOfElements;
14
15 _bits = new byte[NumberOfBits];
16 _sha256 = SHA256.Create();
17 _numberOfHashFunctions = Convert.ToInt32(Math.Ceiling((double)NumberOfBits * Math.Log(2) /(double)NumberOfElements));
18 }
19
20 public void Add(string item)
21 {
22 var hashes = HashString(item);
23 foreach (var hash in hashes)
24 {
25 _bits[hash] = 1;
26 }
27 }
28
29 public bool Contains(string item)
30 {
31 var hashes = HashString(item);
32 foreach (var hash in hashes)
33 {
34 if (_bits[hash] == 0) return false;
35 }
36
37 return true;
38 }
39
40 #region Private Methods
41
42 private long[] HashString(string pattern)
43 {
44 var result = new long[_numberOfHashFunctions];
45
46 var data = Encoding.UTF8.GetBytes(pattern);
47 var hashBytes = _sha256.ComputeHash(data);
48
49 var h1 = BitConverter.ToInt64(hashBytes, 0);
50 var h2 = BitConverter.ToInt64(hashBytes, 16);
51
52 var hash = h1;
53 for (var i = 0; i<_numberOfHashFunctions; i++)
54 {
55 hash += (i * h2);
56 result[i] = Math.Abs(hash % NumberOfBits);
57 }
58
59 return result;
60 }
61
62 #endregion
63}
Defining the Program.cs class
1internal class Program
2{
3 static void Main(string[] args)
4 {
5 var summary = BenchmarkRunner.Run(typeof(Program).Assembly);
6 }
7}
8
9public class BloomFilterBenchmark
10{
11 private static int N = 10000;
12 private static int M = 16 * N;
13
14 private BloomFilter _filter = new BloomFilter(M, N);
15 private List<string> _urls = new List<string>();
16
17 public BloomFilterBenchmark()
18 {
19 for (var i = 0; i<N; i++)
20 {
21 var guid = $"https://www.{Guid.NewGuid()}.com";
22 _filter.Add(guid);
23 _urls.Add(guid);
24 }
25 }
26
27 private readonly string url = "https://www.google.com";
28
29 [Benchmark]
30 public bool ContainsNative() => _urls.Contains(url);
31
32 [Benchmark]
33 public bool ContainsBloomFilter() => _filter.Contains(url);
34}
This code is straightforward: we generate 10,000 URLs and verify that an unknown URL correctly returns false. To do so, we use two methods: the native Contains method of the List class and our implementation of a Bloom filter.
Running the program
As expected, we can observe that a Bloom filter is highly efficient and outperforms other methods by several orders of magnitude.
Utilizing Bloom filters for efficient querying in large-scale datasets - Part 7