Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TF*IDF for Search Queries

Okay, so I have been following these two posts on TF*IDF but am little confused : http://css.dzone.com/articles/machine-learning-text-feature

Basically, I want to create a search query that contains searches through multiple documents. I would like to use the scikit-learn toolkit as well as the NLTK library for Python

The problem is that I don't see where the two TF*IDF vectors come from. I need one search query and multiple documents to search. I figured that I calculate the TF*IDF scores of each document against each query and find the cosine similarity between them, and then rank them by sorting the scores in descending order. However, the code doesn't seem to come up with the right vectors.

Whenever I reduce the query to only one search, it is returning a huge list of 0's which is really strange.

Here is the code:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords

train_set = ("The sky is blue.", "The sun is bright.") #Documents
test_set = ("The sun in the sky is bright.") #Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
transformer = TfidfTransformer()

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)

tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()
like image 684
tabchas Avatar asked Aug 11 '12 02:08

tabchas


People also ask

How is TF-IDF used in search engines?

Google uses TF-IDF to determine which terms are topically relevant (or irrelevant) by analyzing how often a term appears on a page (term frequency — TF) and how often it's expected to appear on an average page, based on a larger set of documents (inverse document frequency — IDF).

What is TF-IDF search?

TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents (also known as a ...

How does TF-IDF enhance the relevance of a search result?

TF-IDF was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word.

What is TF and IDF in SEO?

TF-IDF stands for term frequency-inverse document frequency. It's a text analysis technique that Google uses as a ranking factor — it signifies how important a word or phrase is to a document in a corpus (i.e. a blog on the internet).


1 Answers

You're defining train_set and test_set as tuples, but I think that they should be lists:

train_set = ["The sky is blue.", "The sun is bright."] #Documents
test_set = ["The sun in the sky is bright."] #Query

Using this the code seems to run fine.

like image 92
Sicco Avatar answered Oct 15 '22 02:10

Sicco