Understanding vector databases - Part 4

In this concluding post, we will delve into several algorithms aimed at efficiently computing similarities.

Vector databases are structured to enable rapid similarity computations. However, when seeking the k-nearest neighbors of a given vector, computing the distance between every vector (brute force approach) becomes impractical, especially when dealing with billions of vectors. To address this challenge, various techniques have been developed to perform efficient calculations. These techniques are commonly known as approximate nearest neighbors (ANN).

Here, we will simply outline the primary existing algorithms and refer to more specialized resources for further details.

Discovering ANNOY

ANNOY (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings that provides an efficient implementation of approximate nearest neighbor search algorithms. ANNOY is designed to quickly find approximate nearest neighbors for high-dimensional data sets. It achieves this by constructing binary trees that partition the data space and then efficiently traversing these trees to identify the nearest neighbors.


Discovering LSH

LSH (Locality Sensitive Hashing) is a technique used for approximate nearest neighbor search in high-dimensional spaces. LSH works by hashing similar data points into the same buckets with high probability, thus enabling efficient approximate similarity search. The key idea behind LSH is to design hash functions that map similar data points to the same or nearby hash buckets, while mapping dissimilar points to different buckets with high probability.


Discovering HNSW

HNSW (Hierarchical Navigable Small World) is a method for constructing data structures to facilitate efficient approximate nearest neighbor search in high-dimensional spaces. HNSW builds a hierarchical graph structure where each node represents a data point and edges connect nodes to their nearest neighbors. The graph is constructed in such a way that it exhibits small-world properties, meaning that nodes are connected to both their nearest neighbors and other nodes that are further away but still relevant.


Final thoughts

If you wish to delve deeper into this topic, acquire the following book, which encompasses all the concepts emphasized in this series and delve into more advanced ones.

Vector Databases for Natural Language Processing: Building Intelligent NLP Applications with High-Dimensional Text Representations (Allen)

Do not hesitate to contact me shoud you require further information.