Python: tf-idf-cosine: to find document similarity

Tags:

I was following a tutorial which was available at Part 1 & Part 2. Unfortunately the author didn't have the time for the final section which involved using cosine similarity to actually find the distance between two documents. I followed the examples in the article with the help of the following link from stackoverflow, included is the code mentioned in the above link (just so as to make life easier)

from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from nltk.corpus import stopwords import numpy as np import numpy.linalg as LA  train_set = ["The sky is blue.", "The sun is bright."]  # Documents test_set = ["The sun in the sky is bright."]  # Query stopWords = stopwords.words('english')  vectorizer = CountVectorizer(stop_words = stopWords) #print vectorizer transformer = TfidfTransformer() #print transformer  trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() testVectorizerArray = vectorizer.transform(test_set).toarray() print 'Fit Vectorizer to train set', trainVectorizerArray print 'Transform Vectorizer to test set', testVectorizerArray  transformer.fit(trainVectorizerArray) print print transformer.transform(trainVectorizerArray).toarray()  transformer.fit(testVectorizerArray) print  tfidf = transformer.transform(testVectorizerArray) print tfidf.todense()

as a result of the above code I have the following matrix

Fit Vectorizer to train set [[1 0 1 0]  [0 1 0 1]] Transform Vectorizer to test set [[0 1 1 1]]  [[ 0.70710678  0.          0.70710678  0.        ]  [ 0.          0.70710678  0.          0.70710678]]  [[ 0.          0.57735027  0.57735027  0.57735027]]

I am not sure how to use this output in order to calculate cosine similarity, I know how to implement cosine similarity with respect to two vectors of similar length but here I am not sure how to identify the two vectors.

915

asked Aug 25 '12 02:08

add-semi-colons

2 Answers

WIth the Help of @excray's comment, I manage to figure it out the answer, What we need to do is actually write a simple for loop to iterate over the two arrays that represent the train data and test data.

First implement a simple lambda function to hold formula for the cosine calculation:

cosine_function = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

And then just write a simple for loop to iterate over the to vector, logic is for every "For each vector in trainVectorizerArray, you have to find the cosine similarity with the vector in testVectorizerArray."

from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from nltk.corpus import stopwords import numpy as np import numpy.linalg as LA  train_set = ["The sky is blue.", "The sun is bright."] #Documents test_set = ["The sun in the sky is bright."] #Query stopWords = stopwords.words('english')  vectorizer = CountVectorizer(stop_words = stopWords) #print vectorizer transformer = TfidfTransformer() #print transformer  trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() testVectorizerArray = vectorizer.transform(test_set).toarray() print 'Fit Vectorizer to train set', trainVectorizerArray print 'Transform Vectorizer to test set', testVectorizerArray cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)  for vector in trainVectorizerArray:     print vector     for testV in testVectorizerArray:         print testV         cosine = cx(vector, testV)         print cosine  transformer.fit(trainVectorizerArray) print print transformer.transform(trainVectorizerArray).toarray()  transformer.fit(testVectorizerArray) print  tfidf = transformer.transform(testVectorizerArray) print tfidf.todense()

Here is the output:

Fit Vectorizer to train set [[1 0 1 0]  [0 1 0 1]] Transform Vectorizer to test set [[0 1 1 1]] [1 0 1 0] [0 1 1 1] 0.408 [0 1 0 1] [0 1 1 1] 0.816  [[ 0.70710678  0.          0.70710678  0.        ]  [ 0.          0.70710678  0.          0.70710678]]  [[ 0.          0.57735027  0.57735027  0.57735027]]

answered Oct 08 '22 13:10

add-semi-colons

First off, if you want to extract count features and apply TF-IDF normalization and row-wise euclidean normalization you can do it in one operation with TfidfVectorizer:

>>> from sklearn.feature_extraction.text import TfidfVectorizer >>> from sklearn.datasets import fetch_20newsgroups >>> twenty = fetch_20newsgroups()  >>> tfidf = TfidfVectorizer().fit_transform(twenty.data) >>> tfidf <11314x130088 sparse matrix of type '<type 'numpy.float64'>'     with 1787553 stored elements in Compressed Sparse Row format>

Now to find the cosine distances of one document (e.g. the first in the dataset) and all of the others you just need to compute the dot products of the first vector with all of the others as the tfidf vectors are already row-normalized.

As explained by Chris Clark in comments and here Cosine Similarity does not take into account the magnitude of the vectors. Row-normalised have a magnitude of 1 and so the Linear Kernel is sufficient to calculate the similarity values.

The scipy sparse matrix API is a bit weird (not as flexible as dense N-dimensional numpy arrays). To get the first vector you need to slice the matrix row-wise to get a submatrix with a single row:

>>> tfidf[0:1] <1x130088 sparse matrix of type '<type 'numpy.float64'>'     with 89 stored elements in Compressed Sparse Row format>

scikit-learn already provides pairwise metrics (a.k.a. kernels in machine learning parlance) that work for both dense and sparse representations of vector collections. In this case we need a dot product that is also known as the linear kernel:

>>> from sklearn.metrics.pairwise import linear_kernel >>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten() >>> cosine_similarities array([ 1.        ,  0.04405952,  0.11016969, ...,  0.04433602,     0.04457106,  0.03293218])

Hence to find the top 5 related documents, we can use argsort and some negative array slicing (most related documents have highest cosine similarity values, hence at the end of the sorted indices array):

>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1] >>> related_docs_indices array([    0,   958, 10576,  3277]) >>> cosine_similarities[related_docs_indices] array([ 1.        ,  0.54967926,  0.32902194,  0.2825788 ])

The first result is a sanity check: we find the query document as the most similar document with a cosine similarity score of 1 which has the following text:

>>> print twenty.data[0] From: [email protected] (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15   I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail.  Thanks, - IL    ---- brought to you by your neighborhood Lerxst ----

The second most similar document is a reply that quotes the original message hence has many common words:

>>> print twenty.data[958] From: [email protected] (Robert Seymour) Subject: Re: WHAT car is this!? Article-I.D.: reed.1993Apr21.032905.29286 Reply-To: [email protected] Organization: Reed College, Portland, OR Lines: 26  In article <[email protected]> [email protected] (where's my thing) writes: > >  I was wondering if anyone out there could enlighten me on this car I saw > the other day. It was a 2-door sports car, looked to be from the late 60s/ > early 70s. It was called a Bricklin. The doors were really small. In addition, > the front bumper was separate from the rest of the body. This is > all I know. If anyone can tellme a model name, engine specs, years > of production, where this car is made, history, or whatever info you > have on this funky looking car, please e-mail.  Bricklins were manufactured in the 70s with engines from Ford. They are rather odd looking with the encased front bumper. There aren't a lot of them around, but Hemmings (Motor News) ususally has ten or so listed. Basically, they are a performance Ford with new styling slapped on top.  >    ---- brought to you by your neighborhood Lerxst ----  Rush fan?  -- Robert Seymour              [email protected] Physics and Philosophy, Reed College    (NeXTmail accepted) Artificial Life Project         Reed College Reed Solar Energy Project (SolTrain)    Portland, OR

156

answered Oct 08 '22 14:10

ogrisel

Related questions
                            
                                In Python, how to display current time in readable format
                            
                                Pandas: how to change all the values of a column?
                            
                                Set Django's FileField to an existing file
                            
                                List of dicts to/from dict of lists
                            
                                Defining the midpoint of a colormap in matplotlib
                            
                                Can I make an admin field not required in Django without creating a form?
                            
                                Python's lambda with underscore for an argument?
                            
                                Declare function at end of file in Python
                            
                                matplotlib y-axis label on right side
                            
                                Scatter plot and Color mapping in Python
                            
                                Ambiguity in Pandas Dataframe / Numpy Array "axis" definition
                            
                                How to read HDF5 files in Python
                            
                                Python pandas: how to specify data types when reading an Excel file?
                            
                                Best way to get the max value in a Spark dataframe column
                            
                                Python: call a function from string name [duplicate]
                            
                                What does colon equal (:=) in Python mean?
                            
                                Python NameError: name 'include' is not defined [closed]
                            
                                split string in to 2 based on last occurrence of a separator
                            
                                How to avoid "Permission denied" when using pip with virtualenv
                            
                                Reading *.wav files in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: tf-idf-cosine: to find document similarity

Tags:

python

machine-learning

nltk

information-retrieval

tf-idf

add-semi-colons

People also ask

2 Answers

add-semi-colons

ogrisel

Recent Activity

Donate For Us