TF-IDF Simple Use - NLTK/Scikit Learn

Question

Okay so I am a little confused. This should be a simple straightforward question however.

After calculating the TF-IDF Matrix of the Document against the entire corpus, I get a result very similar to this:

array([[ 0.85...,  0.  ...,  0.52...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 0.55...,  0.83...,  0.  ...],
       [ 0.63...,  0.  ...,  0.77...]])

How do I use this result to get the most similar document against the search query? Basically I am trying to re-create a search bar for Wikipedia. Based on a search query I want to return the most relevant articles from Wikipedia. In this scenario, there are 6 articles (rows) and the search query contains 3 words (columns).

Do I add up all the results in the columns or add up all the rows? Is the greater value the most relevant or is the lowest value the most relevant?

verbsintransit · Accepted Answer

Are you familiar with cosine similarity? For each article (vector A) compute its similarity to the query (vector B). Then rank in descending order and choose the top result. If you're willing to refactor, the gensim library is excellent.

TF-IDF Simple Use - NLTK/Scikit Learn

Tags:

python

nlp

nltk

scikit-learn

tf-idf

tabchas

1 Answers

verbsintransit

Recent Activity

Donate For Us

TF-IDF Simple Use - NLTK/Scikit Learn

Tags:

python

nlp

nltk

scikit-learn

tf-idf

tabchas

1 Answers

verbsintransit

Related questions

Recent Activity

Donate For Us