Inspired by the following git and video I'm trying to create a conceptual search for my domain, using word2vec
as a synonyms filter for my queries.
Giving the following document structure:
{ "_index": "conversations", "_type": "conversation", "_id": "103130", "_score": 0.97602403, "_source": { "context": "Welcome to our service, how can I help? do you offer a free trial", "answer": "Yes we do. Here is a link for our trial account." } }
I would like to iterate through the entire index and extract words with "higher significant" (tf-idf ?).
Once I will have the top 100 words list, I'll create a synonyms filter using word2vec
.
My question is: How can this be done using ES Node JS client?
You can use cURL in a UNIX terminal or Windows command prompt, the Kibana Console UI, or any one of the various low-level clients available to make an API call to get all of the documents in an Elasticsearch index. All of these methods use a variation of the GET request to search the index.
Since you're using the _cat/indices API, you could simply return the results in JSON (instead of tabular form) and then pipe that into jq in order to get the length of the index array you get in the response. This will return a number equals to the number of indices you have.
Elasticsearch lets you search through vast amounts of data, whether you're implementing real-time search experiences or doing in-depth data analysis. In this tutorial, you'll learn how to integrate Elasticsearch into your Node. js app.
Tf-Idf of documents is typically used to find the similarity of documents (using Cosine similarity, euclidean distance etc)
Tf or term frequency indicates frequency of a word in the document. Higher the frequency of the word, higher the importance of the word.
Idf or inverse document frequency indicates the number of documents(of input collection) that contains the word. More rare the word, higher the importance of the word.
If we just use TF to build document vector, we are prone to spam because common words(for eg: pronouns, conjunctions etc) gain more importance. Hence, combination of td-idf gives better meaning and indicates the real significance of the word. Or in other words to rank words of a document based on the significance, it is not advised to calculate just the tf of each word, instead use tf-idf on the entire input collection and rank based on the tf-idf value which shows the real significance of keywords.
Have a look at sample python solution for calculating tf-idf value for json tweets list and finding the similar tweets.
Github Sample
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With