Okay so I am a little confused. This should be a simple straightforward question however.
After calculating the TF-IDF Matrix of the Document against the entire corpus, I get a result very similar to this:
array([[ 0.85..., 0. ..., 0.52...],
[ 1. ..., 0. ..., 0. ...],
[ 1. ..., 0. ..., 0. ...],
[ 1. ..., 0. ..., 0. ...],
[ 0.55..., 0.83..., 0. ...],
[ 0.63..., 0. ..., 0.77...]])
How do I use this result to get the most similar document against the search query? Basically I am trying to re-create a search bar for Wikipedia. Based on a search query I want to return the most relevant articles from Wikipedia. In this scenario, there are 6 articles (rows) and the search query contains 3 words (columns).
Do I add up all the results in the columns or add up all the rows? Is the greater value the most relevant or is the lowest value the most relevant?
Are you familiar with cosine similarity? For each article (vector A) compute its similarity to the query (vector B). Then rank in descending order and choose the top result. If you're willing to refactor, the gensim library is excellent.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With