I'd like to store an n-dimensional feature vector, e.g. <1.00, 0.34, 0.22, ..., 0>
, with each document, and then provide another feature vector as a query, with the results sorted in order of cosine similarity. Is this possible with Elastic Search?
Elasticsearch's own implementation of vector search Elasticsearch is using Apache Lucene internally as a search engine, so many of the low-level concepts, data structures and algorithms (if not all) apply equally to Solr and Elasticsearch.
Semantic Search, a form of search usually used in search engines, serves content to the users understanding the intent and meaning of the user's search query. This search is a step ahead of the traditional text and keyword match search.
The dense_vector field type stores dense vectors of float values. Dense vector fields can be used in the following ways: In script_score queries, to score documents matching a filter. In the kNN search API, to find the k most similar vectors to a query vector.
Elasticsearch runs Lucene under the hood so by default it uses Lucene's Practical Scoring Function. This is a similarity model based on Term Frequency (tf) and Inverse Document Frequency (idf) that also uses the Vector Space Model (vsm) for multi-term queries.
I don't have an answer particular to Elastic Search because I've never used it (I use Lucene on which Elastic search is built). However, I'm trying to give a generic answer to your question. There are two standard ways to obtain the nearest vectors given a query vector, described as follows.
K-d tree
The first approach is to store the vectors in memory with the help of a data structure that supports nearest neighbour queries, e.g. k-d trees. A k-d tree is a generalization of the binary search tree in the sense that every level of the binary search tree partitions one of the k dimensions into two parts. If you have enough space to load all the points in memory, it is possible to apply the nearest neighbour search algorithm on k-d trees to obtain a list of retrieved vectors sorted by the cosine similarity values. The obvious disadvantage of this method is that it does not scale to huge sets of points, as often encountered in information retrieval.
Inverted Quantized Vectors
The second approach is to use inverted quantized vectors. A simple range-based quantization assigns pseudo-terms or labels to the real numbers of a vector so that these can then later be indexed by Lucene (or for that matter Elastic search).
For example, we may assign the label A to the range [0, 0.1), B to the range [0.1, 0.2) and so on... The sample vector in your question is then encoded as (J,D,C,..A). (because [.9,1] is J, [0.3,0.4) is D and so on).
Consequently, a vector of real numbers is thus transformed into a string (which can be treated as a document) and hence indexed with a standard information retrieval (IR) tool. A query vector is also transformed into a bag of pseudo-terms and thus one can compute a set of other similar vectors in the collection most similar (in terms of cosine similarity or other measure) to the current one.
The main advantage of this method is that it scales well for massive collection of real numbered vectors. The key disadvantage is that the computed similarity values are mere approximations to the true cosine similarities (due to the loss encountered in quantization). A smaller quantization range achieves better performance at the cost of increased index size.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With