Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting most important words from Elasticsearch index, using Node JS client

Tags:

Inspired by the following git and video I'm trying to create a conceptual search for my domain, using word2vec as a synonyms filter for my queries.

Giving the following document structure:

{         "_index": "conversations",         "_type": "conversation",         "_id": "103130",         "_score": 0.97602403,         "_source": {           "context": "Welcome to our service, how can I help? do you offer a free trial",           "answer": "Yes we do. Here is a link for our trial account."         }       } 

I would like to iterate through the entire index and extract words with "higher significant" (tf-idf ?).
Once I will have the top 100 words list, I'll create a synonyms filter using word2vec.

My question is: How can this be done using ES Node JS client?

like image 837
Shlomi Schwartz Avatar asked Nov 14 '16 14:11

Shlomi Schwartz


People also ask

What is the Elasticsearch query to get all documents from an index?

You can use cURL in a UNIX terminal or Windows command prompt, the Kibana Console UI, or any one of the various low-level clients available to make an API call to get all of the documents in an Elasticsearch index. All of these methods use a variation of the GET request to search the index.

How do you count indices in Elasticsearch?

Since you're using the _cat/indices API, you could simply return the results in JSON (instead of tabular form) and then pipe that into jq in order to get the length of the index array you get in the response. This will return a number equals to the number of indices you have.

Can we use Elasticsearch with node js?

Elasticsearch lets you search through vast amounts of data, whether you're implementing real-time search experiences or doing in-depth data analysis. In this tutorial, you'll learn how to integrate Elasticsearch into your Node. js app.


1 Answers

Tf-Idf of documents is typically used to find the similarity of documents (using Cosine similarity, euclidean distance etc)

Tf or term frequency indicates frequency of a word in the document. Higher the frequency of the word, higher the importance of the word.

Idf or inverse document frequency indicates the number of documents(of input collection) that contains the word. More rare the word, higher the importance of the word.

If we just use TF to build document vector, we are prone to spam because common words(for eg: pronouns, conjunctions etc) gain more importance. Hence, combination of td-idf gives better meaning and indicates the real significance of the word. Or in other words to rank words of a document based on the significance, it is not advised to calculate just the tf of each word, instead use tf-idf on the entire input collection and rank based on the tf-idf value which shows the real significance of keywords.

Have a look at sample python solution for calculating tf-idf value for json tweets list and finding the similar tweets.

Github Sample

like image 85
GoT Avatar answered Nov 27 '22 15:11

GoT