Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word2vec with elasticsearch for texts similarity

Tags:

I have a large collection of texts, where each text is rapidly growing. I need to implement a similarity search.

The idea is to embed each word as word2vec, and represent each text as a normalized vector by vector-adding the embeddings of each word in it. The subsequent additions to the text would only result in the refinement of the resultant text's vector by adding new word vectors to it.

Is it possible to use elasticsearch for cosine similarity, by storing only the coordinates of each text's normalized vector in a document? If so, what's the proper index structure for such search?

like image 459
Alec Matusis Avatar asked Feb 23 '17 06:02

Alec Matusis


People also ask

How does Word2Vec measure similarity?

Therefore, Word2Vec can capture the similarity value between words from the training of a large corpus. The resulting similarity value is obtained from the word vector value than calculated using the Cosine Similarity equation.

Does Elasticsearch support semantic search?

Elasticsearch has a very weak semantic search support but you can go around it using faceted searching and bag of words. You can index a thesaurus schema for plumbing terms, then do a semantic matching over the text phrases in your sentences.

Can Word2Vec be used for search?

Listing 3: word2vec similarity with 100 dimensions and a larger dataset. We can see now that the results are much better and appropriate: we can use almost all of them as synonyms in the context of search. You can imagine using such a technique either at query or indexing time.

Is Elasticsearch a vector database?

Scalable Semantic Vector Search with Elasticsearch Elasticsearch is a popular open-source full-text search engine that can search many types of documents, and it recently added a dense_vector field type that stores dense vectors of float values.


2 Answers

This elasticsearch plugin implements a score function (dot product) for vectors stored using the delimited-payload-tokenfilter

The complexity of this search is a linear function of number of documents, and it is worse than tf-idf on a term query, since ES first searches on an inverted index then it uses tf-idf for document scores, so tf-idf is not executed on all the documents of the index. With the vector, the representation you're searching for is the vector space of the document with the lower cosine distance, without the advantages of the inverted index.

like image 187
angleto Avatar answered Sep 30 '22 00:09

angleto


For Elasticsearch 6.4.x StaySense has made this plugin available.

like image 32
Alex Moore-Niemi Avatar answered Sep 30 '22 01:09

Alex Moore-Niemi