I was following a tutorial which was available at Part 1 & Part 2. Unfortunately the author didn't have the time for the final section which involved using cosine similarity to actually find the distance between two documents. I followed the examples in the article with the help of the following link from stackoverflow, included is the code mentioned in the above link (just so as to make life easier)
from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from nltk.corpus import stopwords import numpy as np import numpy.linalg as LA train_set = ["The sky is blue.", "The sun is bright."] # Documents test_set = ["The sun in the sky is bright."] # Query stopWords = stopwords.words('english') vectorizer = CountVectorizer(stop_words = stopWords) #print vectorizer transformer = TfidfTransformer() #print transformer trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() testVectorizerArray = vectorizer.transform(test_set).toarray() print 'Fit Vectorizer to train set', trainVectorizerArray print 'Transform Vectorizer to test set', testVectorizerArray transformer.fit(trainVectorizerArray) print print transformer.transform(trainVectorizerArray).toarray() transformer.fit(testVectorizerArray) print tfidf = transformer.transform(testVectorizerArray) print tfidf.todense()
as a result of the above code I have the following matrix
Fit Vectorizer to train set [[1 0 1 0] [0 1 0 1]] Transform Vectorizer to test set [[0 1 1 1]] [[ 0.70710678 0. 0.70710678 0. ] [ 0. 0.70710678 0. 0.70710678]] [[ 0. 0.57735027 0.57735027 0.57735027]]
I am not sure how to use this output in order to calculate cosine similarity, I know how to implement cosine similarity with respect to two vectors of similar length but here I am not sure how to identify the two vectors.
Tf-idf is a transformation you apply to texts to get two real-valued vectors. You can then obtain the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors.
From Python: tf-idf-cosine: to find document similarity , it is possible to calculate document similarity using tf-idf cosine.
The common way to compute the Cosine similarity is to first we need to count the word occurrence in each document. To count the word occurrence in each document, we can use CountVectorizer or TfidfVectorizer functions that are provided by Scikit-Learn library.
WIth the Help of @excray's comment, I manage to figure it out the answer, What we need to do is actually write a simple for loop to iterate over the two arrays that represent the train data and test data.
First implement a simple lambda function to hold formula for the cosine calculation:
cosine_function = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)
And then just write a simple for loop to iterate over the to vector, logic is for every "For each vector in trainVectorizerArray, you have to find the cosine similarity with the vector in testVectorizerArray."
from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from nltk.corpus import stopwords import numpy as np import numpy.linalg as LA train_set = ["The sky is blue.", "The sun is bright."] #Documents test_set = ["The sun in the sky is bright."] #Query stopWords = stopwords.words('english') vectorizer = CountVectorizer(stop_words = stopWords) #print vectorizer transformer = TfidfTransformer() #print transformer trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() testVectorizerArray = vectorizer.transform(test_set).toarray() print 'Fit Vectorizer to train set', trainVectorizerArray print 'Transform Vectorizer to test set', testVectorizerArray cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3) for vector in trainVectorizerArray: print vector for testV in testVectorizerArray: print testV cosine = cx(vector, testV) print cosine transformer.fit(trainVectorizerArray) print print transformer.transform(trainVectorizerArray).toarray() transformer.fit(testVectorizerArray) print tfidf = transformer.transform(testVectorizerArray) print tfidf.todense()
Here is the output:
Fit Vectorizer to train set [[1 0 1 0] [0 1 0 1]] Transform Vectorizer to test set [[0 1 1 1]] [1 0 1 0] [0 1 1 1] 0.408 [0 1 0 1] [0 1 1 1] 0.816 [[ 0.70710678 0. 0.70710678 0. ] [ 0. 0.70710678 0. 0.70710678]] [[ 0. 0.57735027 0.57735027 0.57735027]]
First off, if you want to extract count features and apply TF-IDF normalization and row-wise euclidean normalization you can do it in one operation with TfidfVectorizer
:
>>> from sklearn.feature_extraction.text import TfidfVectorizer >>> from sklearn.datasets import fetch_20newsgroups >>> twenty = fetch_20newsgroups() >>> tfidf = TfidfVectorizer().fit_transform(twenty.data) >>> tfidf <11314x130088 sparse matrix of type '<type 'numpy.float64'>' with 1787553 stored elements in Compressed Sparse Row format>
Now to find the cosine distances of one document (e.g. the first in the dataset) and all of the others you just need to compute the dot products of the first vector with all of the others as the tfidf vectors are already row-normalized.
As explained by Chris Clark in comments and here Cosine Similarity does not take into account the magnitude of the vectors. Row-normalised have a magnitude of 1 and so the Linear Kernel is sufficient to calculate the similarity values.
The scipy sparse matrix API is a bit weird (not as flexible as dense N-dimensional numpy arrays). To get the first vector you need to slice the matrix row-wise to get a submatrix with a single row:
>>> tfidf[0:1] <1x130088 sparse matrix of type '<type 'numpy.float64'>' with 89 stored elements in Compressed Sparse Row format>
scikit-learn already provides pairwise metrics (a.k.a. kernels in machine learning parlance) that work for both dense and sparse representations of vector collections. In this case we need a dot product that is also known as the linear kernel:
>>> from sklearn.metrics.pairwise import linear_kernel >>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten() >>> cosine_similarities array([ 1. , 0.04405952, 0.11016969, ..., 0.04433602, 0.04457106, 0.03293218])
Hence to find the top 5 related documents, we can use argsort
and some negative array slicing (most related documents have highest cosine similarity values, hence at the end of the sorted indices array):
>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1] >>> related_docs_indices array([ 0, 958, 10576, 3277]) >>> cosine_similarities[related_docs_indices] array([ 1. , 0.54967926, 0.32902194, 0.2825788 ])
The first result is a sanity check: we find the query document as the most similar document with a cosine similarity score of 1 which has the following text:
>>> print twenty.data[0] From: [email protected] (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ----
The second most similar document is a reply that quotes the original message hence has many common words:
>>> print twenty.data[958] From: [email protected] (Robert Seymour) Subject: Re: WHAT car is this!? Article-I.D.: reed.1993Apr21.032905.29286 Reply-To: [email protected] Organization: Reed College, Portland, OR Lines: 26 In article <[email protected]> [email protected] (where's my thing) writes: > > I was wondering if anyone out there could enlighten me on this car I saw > the other day. It was a 2-door sports car, looked to be from the late 60s/ > early 70s. It was called a Bricklin. The doors were really small. In addition, > the front bumper was separate from the rest of the body. This is > all I know. If anyone can tellme a model name, engine specs, years > of production, where this car is made, history, or whatever info you > have on this funky looking car, please e-mail. Bricklins were manufactured in the 70s with engines from Ford. They are rather odd looking with the encased front bumper. There aren't a lot of them around, but Hemmings (Motor News) ususally has ten or so listed. Basically, they are a performance Ford with new styling slapped on top. > ---- brought to you by your neighborhood Lerxst ---- Rush fan? -- Robert Seymour [email protected] Physics and Philosophy, Reed College (NeXTmail accepted) Artificial Life Project Reed College Reed Solar Energy Project (SolTrain) Portland, OR
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With