Extracting most important words from Elasticsearch index, using Node JS client

Tags:

Inspired by the following git and video I'm trying to create a conceptual search for my domain, using word2vec as a synonyms filter for my queries.

Giving the following document structure:

{         "_index": "conversations",         "_type": "conversation",         "_id": "103130",         "_score": 0.97602403,         "_source": {           "context": "Welcome to our service, how can I help? do you offer a free trial",           "answer": "Yes we do. Here is a link for our trial account."         }       }

I would like to iterate through the entire index and extract words with "higher significant" (tf-idf ?).
Once I will have the top 100 words list, I'll create a synonyms filter using word2vec.

My question is: How can this be done using ES Node JS client?

837

asked Nov 14 '16 14:11

Shlomi Schwartz

1 Answers

Tf-Idf of documents is typically used to find the similarity of documents (using Cosine similarity, euclidean distance etc)

Tf or term frequency indicates frequency of a word in the document. Higher the frequency of the word, higher the importance of the word.

Idf or inverse document frequency indicates the number of documents(of input collection) that contains the word. More rare the word, higher the importance of the word.

If we just use TF to build document vector, we are prone to spam because common words(for eg: pronouns, conjunctions etc) gain more importance. Hence, combination of td-idf gives better meaning and indicates the real significance of the word. Or in other words to rank words of a document based on the significance, it is not advised to calculate just the tf of each word, instead use tf-idf on the entire input collection and rank based on the tf-idf value which shows the real significance of keywords.

Have a look at sample python solution for calculating tf-idf value for json tweets list and finding the similar tweets.

Github Sample

answered Nov 27 '22 15:11

GoT

Related questions
                            
                                SceneKit Cocoa snapshot failed assertion
                            
                                Why is VS 2015 stopping diagnostics session is taking forever?
                            
                                How to prevent _t and _v when inserting into MongoDB?
                            
                                libMobileGestalt MobileGestaltSupport.m:153: pid 1668 does not have sandbox access in Xcode console
                            
                                Why does adding ".map(a -> a)" allow this to compile?
                            
                                'Twig_Error_Syntax' with message 'Unknown "render" filter
                            
                                Validation Data Class Parameters Kotlin
                            
                                Unable to locate attached view in the native tree
                            
                                How can I make a theme-able Angular Material NPM module?
                            
                                When Do We Need to Provide Our Own Random Initialization Vector (IV) With Android?
                            
                                AWS Cloud9 doesn't allow static manifest.json but does allow css file
                            
                                Chrome: CPU profile parser is fixing n missing samples

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extracting most important words from Elasticsearch index, using Node JS client

Tags:

Shlomi Schwartz

People also ask

1 Answers

GoT

Recent Activity

Donate For Us