Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TF-IDF Simple Use - NLTK/Scikit Learn

Okay so I am a little confused. This should be a simple straightforward question however.

After calculating the TF-IDF Matrix of the Document against the entire corpus, I get a result very similar to this:

array([[ 0.85...,  0.  ...,  0.52...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 0.55...,  0.83...,  0.  ...],
       [ 0.63...,  0.  ...,  0.77...]])

How do I use this result to get the most similar document against the search query? Basically I am trying to re-create a search bar for Wikipedia. Based on a search query I want to return the most relevant articles from Wikipedia. In this scenario, there are 6 articles (rows) and the search query contains 3 words (columns).

Do I add up all the results in the columns or add up all the rows? Is the greater value the most relevant or is the lowest value the most relevant?

like image 926
tabchas Avatar asked Aug 08 '12 17:08

tabchas


1 Answers

Are you familiar with cosine similarity? For each article (vector A) compute its similarity to the query (vector B). Then rank in descending order and choose the top result. If you're willing to refactor, the gensim library is excellent.

like image 117
verbsintransit Avatar answered Oct 11 '22 21:10

verbsintransit