Understanding vector databases - Part 2
In this post, we will explore the concept of vector databases and elucidate why they are highly suited and even essential in certain scenarios.
In the beginning was data
Until recently, data was predominantly stored in traditional relational databases, and queries of varying complexity were executed to retrieve information.
In the late 2000s, the rise of cloud computing coupled with significant reductions in storage expenses led to an influx of unstructured and diverse data. This included images, videos, audio files, documents, and text of varying natures. Concurrently, the demand to query these diverse data objects persisted, albeit with increased complexity.How can we address this challenge ?
In everyday life, what does the concept of similarity entail ?
When we remark, for instance, that our sister bears a resemblance to our friend's daughter, we engage in human-based reasoning, employing similarity criteria such as eye color or other physical attributes. Similarly, we might assert that two documents resemble each other due to their utilization of identical words or synonymous expressions. In this scenario, we're analyzing documents and employing words as features to conduct a comparison.
The entities under consideration here, whether human beings or documents, aren't easily quantifiable in the traditional sense, as we're not employing mathematical reasoning to determine a precise degree of similarity. Rather, our brains are naturally adept at intuitively discerning commonalities between these entities, often without the need for explicit calculation.So, how can we introduce a degree of formality ?
Mathematics to the rescue
Often, mathematics serves as the key to solving a portion of our problems. Here, we're grappling with a few formal concepts.
What are vector spaces ?
A vector space is a set equipped with two operations: vector addition and scalar multiplication. These operations must satisfy certain properties, including closure under addition and scalar multiplication, associativity, commutativity, the existence of an additive identity (zero vector), and the existence of additive inverses for every vector. Vector spaces provide a framework for studying vectors, which can represent a variety of mathematical objects such as geometric quantities, physical quantities, and abstract mathematical entities.
Mathematicians have crafted the concept of a vector space to possess a structural stability under linear combination, allowing for the addition of two vectors or their scaling by a constant factor. This foundational concept is pivotal in linear algebra, granting us the ability to manipulate abstract quantities—adding them, subtracting them, and comparing them—in a fundamental manner.
The set of real numbers serves as an example of a vector space: when we add two real numbers, the result remains a real number.
Does mathematics define the notion of similarity ?
In practice, it commonly defines the opposite notion, which is distance, but it's equally possible to define similarity within mathematical frameworks.
Consider a vector space $E$. A metric or distance function on $E$ is a function $d$ which takes pairs of points of $E$ to real numbers and satisfies the following rules:
- The distance between an object and itself is always zero: $d(x,x)=0 \text{ for all } x \in E$
- The distance between distinct objects is always positive.
- Distance is symmetric: $d(x,y)=d(y,x) \text{ for all } x,y \in E$
- Distance satisfies the triangle inequality: $d(x,y)\le d(x,z) + d(z,y) \text{ for all } x,y,z \in E$
Distances can theoretically be defined in more general spaces, known as metric spaces, where elements aren't required to adhere to the additive properties of vector spaces. However, the properties of metric spaces are generally less rich compared to those of vector spaces.
For instance, in the set of real numbers $\R$, the absolute value functions as a metric, allowing us to assess the proximity between two values.
$$d(x,y)=\vert x-y \vert$$
While these examples may seem straightforward, they serve to illustrate the essence of the topic: mathematics has long formalized the concept of proximity, and we should leverage these principles to facilitate comparisons between documents or images.
In practice, we will operate within the vector space $\R^n$, which represents the set of $n$-tuples of real numbers.
Therefore, we are presented with tangible entities such as images, audio files, or documents that we aim to compare, alongside a robust mathematical framework that models similarities.
Yet, how can we bridge the divide between these two realms ?
At this juncture, we encounter a challenge: we possess a set $A$ of unstructured data comprising images, videos, and documents that aren't easily comparable, alongside a robust mathematical framework. The solution to bridging this gap once again arises from mathematics, specifically the concept of a function. A function establishes a mapping between two sets of elements, providing a means to relate the disparate data sets and enable meaningful comparisons.
$$f:A \to \R^n$$
In artificial intelligence, this function is referred to as an embedding. Essentially, we translate, or "embed", each element of set $A$ into an $n$-tuple. However, it's important to note that the reverse operation cannot be universally applied: not every $n$-tuple necessarily represents an element of set $A$ (since the set of real numbers $\R$ is uncountable).
In theory, the concept seems straightforward: we have a function that facilitates a proper mapping between documents or images and a vector space. However, the practical implementation of this function is where the challenge lies. This process involves modeling the function $f$. As a result, in the literature, $f$ is often referred to as the "audio model" (if we aim to embed audio data), or the "video model" (if we aim to embed video data), and so forth, depending on the type of data being embedded. These models are designed to capture meaningful representations of the respective data types in a vector space, enabling comparisons, clustering, classification, and various downstream tasks in machine learning and artificial intelligence.
Numerous models have been conceptualized and deployed based on specific requirements and the nature of the data. To illustrate this concept concretely, we will delve into one such model: tf-idf (as detailed in the subsequent post).
And so, what are vector databases ?
Once data such as images or documents have been transformed into embeddings within a vector space, they can be stored in a datastore. While traditional relational database management systems (RDBMS) could serve this purpose, they are not specifically designed for storing vectors and are therefore not optimized for such data. This necessity has led to the development of vector databases.
Vector databases are specialized databases designed to efficiently store and query high-dimensional vector data. Unlike traditional relational databases that are optimized for tabular data, vector databases are tailored to handle vectors as primary data types.
They provide mechanisms for storing, indexing, and querying vectors in a manner that preserves their high-dimensional structure and enables efficient similarity search and other vector-based operations. Examples of vector databases include Milvus, Faiss, and Pinecone.
What are the practical uses of vector databases ?
Once a document or an image has been stored, it becomes possible to execute queries against these vectors. Conducting similarity searches with vectors is indeed a straightforward process, as demonstrated earlier.
Vector databases facilitate the indexing and searching of image or video features, enabling users to search for images or videos with similar visual characteristics. For instance, a user could seek images depicting a specific object or scene, and the system would retrieve visually similar images.
Vector databases can drive product recommendation engines, offering personalized suggestions derived from a user's historical purchases or browsing habits. Leveraging the indexing of product vectors and employing similarity search functionalities, the system can swiftly pinpoint products that closely match a user's preferences.
Vector databases can be used for fraud detection.
...
In the subsequent post, we will delve into a specific model utilized for documents.