I have a large collection of texts, where each text is rapidly growing. I need to implement a similarity search.
The idea is to embed each word as word2vec, and represent each text as a normalized vector by vector-adding the embeddings of each word in it. The subsequent additions to the text would only result in the refinement of the resultant text's vector by adding new word vectors to it.
Is it possible to use elasticsearch for cosine similarity, by storing only the coordinates of each text's normalized vector in a document? If so, what's the proper index structure for such search?
Therefore, Word2Vec can capture the similarity value between words from the training of a large corpus. The resulting similarity value is obtained from the word vector value than calculated using the Cosine Similarity equation.
Elasticsearch has a very weak semantic search support but you can go around it using faceted searching and bag of words. You can index a thesaurus schema for plumbing terms, then do a semantic matching over the text phrases in your sentences.
Listing 3: word2vec similarity with 100 dimensions and a larger dataset. We can see now that the results are much better and appropriate: we can use almost all of them as synonyms in the context of search. You can imagine using such a technique either at query or indexing time.
Scalable Semantic Vector Search with Elasticsearch Elasticsearch is a popular open-source full-text search engine that can search many types of documents, and it recently added a dense_vector field type that stores dense vectors of float values.
This elasticsearch plugin implements a score function (dot product) for vectors stored using the delimited-payload-tokenfilter
The complexity of this search is a linear function of number of documents, and it is worse than tf-idf on a term query, since ES first searches on an inverted index then it uses tf-idf for document scores, so tf-idf is not executed on all the documents of the index. With the vector, the representation you're searching for is the vector space of the document with the lower cosine distance, without the advantages of the inverted index.
For Elasticsearch 6.4.x StaySense has made this plugin available.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With