The primary functionality of the vector store is straightforward: identifying the most similar vectors to a given vector. While the concept is simple, translating it into a practical product poses significant challenges.
A simple and basic approach to searching in a vector database is to perform an exhaustive search by comparing a query vector to every other vector stored in the database one by one. However, this consumes too many resources and results in very high latencies, making it not very practical. To address this problem, Approximate Nearest Neighbor (ANN
) algorithms are used. ANN
search approximates the true nearest neighbor, which means it might not find the absolute closest point, but it will find one that’s close enough, with a low-latency and by consuming fewer resources.
In the literature, the comparison of the results of ANNS
with exhaustive search is called the recall rate.
The higher the recall rate the better the results.
Several ANNS
algorithms, such as HNSW
[1], NSG
[2], and DiskANN
[3], are available for use,
each with its distinct characteristics. One of the difficult problems in ANN algorithms is that indexing and querying vectors may require storing the whole data in memory. When the dataset is huge, then memory requirements for indexing may exceed available memory. DiskANN
algorithm tries to solve this problem by using disk as the main storage for indexes and for performing queries directly on disk.DiskANN
paper acknowledges that, if you try to store your vectors
in disk and use HNSW
or NSG
, you may end up with again very high latencies. DiskANN
is focused
on serving queries from disk with low-latency and good recall rate.
And this helps Upstash Vector to be cost-effective, therefore cheaper compared to alternatives.
Even though DiskANN
has its advantages, it also requires more work to be practical.
Main problem is that, you can’t insert/update existing index without reindexing all the vectors.
For this problem, there is another improved paper FreshDiskANN
[4]. FreshDiskANN
improves DiskANN
via introducing
a temporary index for up-to-date data in memory. Queries are served from both the temporary(up-to-date) index
and also from the disk. And these temporary indexes are merged to the disk from time-to-time behind the scene.
Upstash Vector is based on DiskANN
and FreshDiskANN
with more improvements based on our
tests and observations.
When creating a vector index in Upstash Vector, you have the flexibility to choose from different vector similarity functions. Each function yields distinct query results, catering to specific use cases. Here are the three supported similarity functions:
The score returned from query requests is a normalized value between 0 and 1, where 1 indicates the highest similarity and 0 the lowest regardless of the similarity function used.
Cosine similarity measures the cosine of the angle between two vectors. It is particularly useful when the magnitude of the vectors is not essential, and the focus is on the orientation.
Use Cases:
Score calculation:
(1 + cosine_similarity(v1, v2)) / 2;
Euclidean distance calculates the straight-line distance between two vectors in a multi-dimensional space. It is well-suited for scenarios where the magnitude of vectors is crucial, providing a measure of their spatial separation.
Use Cases:
Score calculation:
1 / (1 + squared_distance(v1, v2))
The dot product measures the similarity by multiplying the corresponding components of two vectors and summing the results. It provides a measure of alignment between vectors. Note that to use dot product, the vectors needs to be normalized to be of unit length.
Use Cases:
Score calculation:
(1 + dot_product(v1, v2)) / 2
Metadata feature allows you to store context with your vectors to make a connection. There can be a couple of uses of this:
You can set metadata with your vector as follows:
When you do a query or fetch, you can opt-in to retrieve the metadata as follows:
The primary functionality of the vector store is straightforward: identifying the most similar vectors to a given vector. While the concept is simple, translating it into a practical product poses significant challenges.
A simple and basic approach to searching in a vector database is to perform an exhaustive search by comparing a query vector to every other vector stored in the database one by one. However, this consumes too many resources and results in very high latencies, making it not very practical. To address this problem, Approximate Nearest Neighbor (ANN
) algorithms are used. ANN
search approximates the true nearest neighbor, which means it might not find the absolute closest point, but it will find one that’s close enough, with a low-latency and by consuming fewer resources.
In the literature, the comparison of the results of ANNS
with exhaustive search is called the recall rate.
The higher the recall rate the better the results.
Several ANNS
algorithms, such as HNSW
[1], NSG
[2], and DiskANN
[3], are available for use,
each with its distinct characteristics. One of the difficult problems in ANN algorithms is that indexing and querying vectors may require storing the whole data in memory. When the dataset is huge, then memory requirements for indexing may exceed available memory. DiskANN
algorithm tries to solve this problem by using disk as the main storage for indexes and for performing queries directly on disk.DiskANN
paper acknowledges that, if you try to store your vectors
in disk and use HNSW
or NSG
, you may end up with again very high latencies. DiskANN
is focused
on serving queries from disk with low-latency and good recall rate.
And this helps Upstash Vector to be cost-effective, therefore cheaper compared to alternatives.
Even though DiskANN
has its advantages, it also requires more work to be practical.
Main problem is that, you can’t insert/update existing index without reindexing all the vectors.
For this problem, there is another improved paper FreshDiskANN
[4]. FreshDiskANN
improves DiskANN
via introducing
a temporary index for up-to-date data in memory. Queries are served from both the temporary(up-to-date) index
and also from the disk. And these temporary indexes are merged to the disk from time-to-time behind the scene.
Upstash Vector is based on DiskANN
and FreshDiskANN
with more improvements based on our
tests and observations.
When creating a vector index in Upstash Vector, you have the flexibility to choose from different vector similarity functions. Each function yields distinct query results, catering to specific use cases. Here are the three supported similarity functions:
The score returned from query requests is a normalized value between 0 and 1, where 1 indicates the highest similarity and 0 the lowest regardless of the similarity function used.
Cosine similarity measures the cosine of the angle between two vectors. It is particularly useful when the magnitude of the vectors is not essential, and the focus is on the orientation.
Use Cases:
Score calculation:
(1 + cosine_similarity(v1, v2)) / 2;
Euclidean distance calculates the straight-line distance between two vectors in a multi-dimensional space. It is well-suited for scenarios where the magnitude of vectors is crucial, providing a measure of their spatial separation.
Use Cases:
Score calculation:
1 / (1 + squared_distance(v1, v2))
The dot product measures the similarity by multiplying the corresponding components of two vectors and summing the results. It provides a measure of alignment between vectors. Note that to use dot product, the vectors needs to be normalized to be of unit length.
Use Cases:
Score calculation:
(1 + dot_product(v1, v2)) / 2
Metadata feature allows you to store context with your vectors to make a connection. There can be a couple of uses of this:
You can set metadata with your vector as follows:
When you do a query or fetch, you can opt-in to retrieve the metadata as follows: