Implementing vector databases in C# with kd-trees - Part 4
In this post, we will explore how to efficiently store multidimensional data by introducing the kd-tree data structure. Our goal is to effectively handle multidimensional searches.
What are binary search trees ?
In computer science, when dealing with search operations, practitioners often turn to tree structures. While we won't delve deeply into this topic here, we will draw inspiration from these data structures to implement kd-trees.
Trees are extensively covered in the following authoritative textbooks.
Introduction to Algorithms (Cormen, Leiserson, Rivest, Stein)
Binary search trees (BSTs) are a type of binary tree data structure that has the following properties:
- Each node in a binary search tree can have at most two children, referred to as the left child and the right child.
- All nodes in the left subtree have values less than the node's value.
- All nodes in the right subtree have values greater than the node's value.
This property ensures that the BST is ordered such that for any node, all nodes in its left subtree have values less than the node's value, and all nodes in its right subtree have values greater than the node's value.
Here's a brief example of a binary search tree.
1 5
2 / \
3 3 8
4 / \ \
5 2 4 10
- The root node has a value of 5.
- The left subtree of the root (with root 3) has nodes with values less than 5 (i.e., 3, 2, 4).
- The right subtree of the root (with root 8) has nodes with values greater than 5 (i.e., 8, 10).
BSTs excel in search operations, which makes them highly suitable for tasks like database indexing. It is thus natural to draw inspiration from them for our needs. But, in contrast to binary search trees which handle unidimensional data by comparing values at each node and choosing a direction, kd-trees are designed to manage multidimensional data within a tree structure. However, this requires a few fundamental changes to the approach.
What are kd-trees ?
A kd-tree is a space-partitioning data structure used to organize points in a $k$-dimensional space.
A kd-tree is a binary tree where each node represents a point in $k$-dimensional space.
At each level of the tree, a different dimension is used to split the space. The root node splits based on the first dimension, the next level splits based on the second dimension, and so on, cycling through the dimensions.
Each node in a kd-tree contains a $k$-dimensional point (referred to as an embedding in our previous discussions) and references to its left and right child nodes.
Here's an example of a kd-tree structure for 2-dimensional data points ($k=2$).
1level 0 (split on first dimension):
2 (5, 3)
3 / \
4level 1 (split on second dimension):
5 (2, 4) (9, 6)
6 / \ / \
7level 2 (split on first dimension):
8(3, 1) (1, 7) (8, 5) (10, 8)
- The root node splits based on the first dimension.
- The second level splits based on the second dimension.
- The third level splits based on the first dimension again, and so on.
In practical use cases, data points are split across more dimensions, but representing them visually becomes challenging. Therefore, illustrations of kd-trees often use $k=2$ for simplicity.
Kd-trees were introduced in 1975 by Jon Bentley (the original paper can be found here). While the definition we propose here differs slightly from the original in terminology, it ultimately achieves the same goal and performs efficient searches.
A kd-tree is constructed by recursively splitting the data points based on median values of the dimensions. This ensures that the tree is balanced, which optimizes search performance.
Our definition stipulates that a kd-tree is fundamentally a binary tree. As a consequence, embeddings are inserted in a specific order: the value corresponding to the split dimension is compared to the current node's value along the same axis. If the value is greater, we recursively insert the embedding in the right node; otherwise, it is inserted in the left node. This process continues until a leaf is reached.
For example, how would we insert $(4, 11)$ into the preceding tree ? Here are the detailed steps to perform this operation.
The root node's first dimension is $5$, and the inserted embedding's first dimension is $4$, so we traverse down to the left ($4 < 5$).
We now compare with the node $(2, 4)$. Since we are at the second level, we compare using the second dimension. The inserted embedding's second dimension is $11$, so we traverse down to the right ($11 > 4$).
We now compare with the node $(1, 7)$. Since we are at the third level, we compare again using the first dimension. The inserted embedding's first dimension is $4$, so we traverse down to the right ($1 < 4$).
1level 0 (split on first dimension):
2 (5, 3)
3 / \
4level 1 (split on second dimension):
5 (2, 4) (9, 6)
6 / \ / \
7level 2 (split on first dimension):
8(3, 1) (1, 7) (8, 5) (10, 8)
9 \
10 (4, 11)
Armed with this theoretical foundation, we will now explore how to conduct efficient searches.