Okay, so I have been following these two posts on TF*IDF but am little confused : http://css.dzone.com/articles/machine-learning-text-feature Basically, I want to create a search query that contains searches through multiple documents. I would like to use the scikit-learn toolkit as well as the NLTK library for Python The problem is that I don't see where the two TF*IDF vectors come from. I need one search query and multiple documents to search. I figured that I calculate the TF*IDF scores of each document against each query and find the cosine similarity between them, and then rank them by sorting the scores in descending order. However, the code doesn't seem to come up with the right vectors. Whenever I reduce the query to only one search, it is returning a huge list of 0's which is really strange. Here is the code: <pre class="prettyprint"><code>from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from nltk.corpus import stopwords train_set = ("The sky is blue.", "The sun is bright.") #Documents test_set = ("The sun in the sky is bright.") #Query stopWords = stopwords.words('english') vectorizer = CountVectorizer(stop_words = stopWords) transformer = TfidfTransformer() trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() testVectorizerArray = vectorizer.transform(test_set).toarray() print 'Fit Vectorizer to train set', trainVectorizerArray print 'Transform Vectorizer to test set', testVectorizerArray transformer.fit(trainVectorizerArray) print transformer.transform(trainVectorizerArray).toarray() transformer.fit(testVectorizerArray) tfidf = transformer.transform(testVectorizerArray) print tfidf.todense() </code></pre>

You're defining <code>train_set</code> and <code>test_set</code> as tuples, but I think that they should be lists: <pre class="prettyprint"><code>train_set = ["The sky is blue.", "The sun is bright."] #Documents test_set = ["The sun in the sky is bright."] #Query </code></pre> Using this the code seems to run fine.

TF*IDF for Search Queries

Tags:

python

nlp

nltk

scikit-learn

tf-idf

Okay, so I have been following these two posts on TF*IDF but am little confused : http://css.dzone.com/articles/machine-learning-text-feature

Basically, I want to create a search query that contains searches through multiple documents. I would like to use the scikit-learn toolkit as well as the NLTK library for Python

The problem is that I don't see where the two TF*IDF vectors come from. I need one search query and multiple documents to search. I figured that I calculate the TF*IDF scores of each document against each query and find the cosine similarity between them, and then rank them by sorting the scores in descending order. However, the code doesn't seem to come up with the right vectors.

Whenever I reduce the query to only one search, it is returning a huge list of 0's which is really strange.

Here is the code:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords

train_set = ("The sky is blue.", "The sun is bright.") #Documents
test_set = ("The sun in the sky is bright.") #Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
transformer = TfidfTransformer()

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)

tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

684

asked Aug 11 '12 02:08

tabchas

1 Answers

You're defining train_set and test_set as tuples, but I think that they should be lists:

train_set = ["The sky is blue.", "The sun is bright."] #Documents
test_set = ["The sun in the sky is bright."] #Query

Using this the code seems to run fine.

answered Oct 15 '22 02:10

Sicco

Related questions
                            
                                pd.rolling_mean becoming deprecated - alternatives for ndarrays
                            
                                Google Cloud Vision - Numbers and Numerals OCR
                            
                                Pandas DataFrame - 'cannot astype a datetimelike from [datetime64[ns]] to [float64]' when using ols/linear regression
                            
                                how to continue for loop after exception?
                            
                                How do you get the filename of a Python wheel when running setup.py?
                            
                                Python "triplet" dictionary?
                            
                                Kerberos authentication with python
                            
                                What's a good library to do computational geometry (like CGAL) in a garbage-collected language?
                            
                                How to exclude DEFAULTs from Python ConfigParser .items()?
                            
                                Best Python module for Berkeley DB? [closed]
                            
                                Plotting a cumulative graph of python datetimes
                            
                                Running pyflakes remotely with flymake and tramp in emacs?
                            
                                In Python, partial function application (currying) versus explicit function definition
                            
                                How to fix absent Python autocompletion on object instances in Vim?
                            
                                Python - Best Module to write into XLS files [closed]
                            
                                Best practices for preventing Denial of Service Attack in Django [closed]
                            
                                Why do I have to type ctrl-d twice? [duplicate]
                            
                                Why does a python descriptor __get__ method accept the owner class as an arg?
                            
                                Securing data in the google app engine datastore
                            
                                Why the various JPEG Extensions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With