Word2vec with elasticsearch for texts similarity

Tags:

I have a large collection of texts, where each text is rapidly growing. I need to implement a similarity search.

The idea is to embed each word as word2vec, and represent each text as a normalized vector by vector-adding the embeddings of each word in it. The subsequent additions to the text would only result in the refinement of the resultant text's vector by adding new word vectors to it.

Is it possible to use elasticsearch for cosine similarity, by storing only the coordinates of each text's normalized vector in a document? If so, what's the proper index structure for such search?

459

asked Feb 23 '17 06:02

Alec Matusis

2 Answers

This elasticsearch plugin implements a score function (dot product) for vectors stored using the delimited-payload-tokenfilter

The complexity of this search is a linear function of number of documents, and it is worse than tf-idf on a term query, since ES first searches on an inverted index then it uses tf-idf for document scores, so tf-idf is not executed on all the documents of the index. With the vector, the representation you're searching for is the vector space of the document with the lower cosine distance, without the advantages of the inverted index.

187

answered Sep 30 '22 00:09

angleto

For Elasticsearch 6.4.x StaySense has made this plugin available.

answered Sep 30 '22 01:09

Alex Moore-Niemi

Related questions
                            
                                Do we need clear MDC after HTTP request in Spring
                            
                                How to make auto renewable subscriptions tied to in-house user, not apple id?
                            
                                Idiomatic Revealing Module Pattern for ES6
                            
                                ASP.NET Core JWT Bearer Token Custom Validation
                            
                                CSS print page-break-after not working with CSS Grid layout
                            
                                Find and replace text in all files rstudio
                            
                                Get parent route params from a lazy loaded route component
                            
                                Kotlin Coroutines in Android Service
                            
                                TensorFlow: How to handle void labeled data in image segmentation?
                            
                                Instagram bans access to its API
                            
                                Angular - Forcing a reactive form to be valid in a unit test
                            
                                python logging root logger does not show info even if I set the level to INFO

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With