Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TD-IDF Find Cosine Similarity Between New Document and Dataset

I have a TF-IDF matrix of a dataset of products:

tfidf = TfidfVectorizer().fit_transform(words)

where words is a list of descriptions. This produces a 69258x22024 matrix.

Now I want to find cosine similarities between a new product and the ones in the matrix, as I need to find the 10 most similar products to it. I vectorize it using the same method above.

However, I cannot multiply the matrices because their sizes are different (the new one would be like 6 words, so a 1x6 matrix), so I need to make a TFIDFVectorizer with the number of columns as the original one.

How do I do it?

like image 353
Mohamed Oun Avatar asked Jul 01 '17 15:07

Mohamed Oun


People also ask

How do you find the cosine similarity between two documents?

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.

How do you find the cosine similarity between two sentences?

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Similarity = (A.B) / (||A||. ||B||) where A and B are vectors.

Why cosine similarity is used in TF-IDF?

TF-IDF will give you a representation for a given term in a document. Cosine similarity will give you a score for two different documents that share the same representation. However, "one of the simplest ranking functions is computed by summing the tf–idf for each query term".


2 Answers

I have found a way for it to work. Instead of using fit_transform, you need to first fit the new document to the corpus TFIDF matrix like this:

queryTFIDF = TfidfVectorizer().fit(words)

Now we can 'transform' this vector into that matrix shape by using the transform function:

queryTFIDF = queryTFIDF.transform([query])

Where query is the query string.
We can then find cosine similarities and find the 10 most similar/relevant documents:

cosine_similarities = cosine_similarity(queryTFIDF, datasetTFIDF).flatten()
related_product_indices = cosine_similarities.argsort()[:-11:-1]
like image 112
Mohamed Oun Avatar answered Oct 17 '22 22:10

Mohamed Oun


I think words variable is ambiguous. I advise you to rename words to corpus.

In fact you put all your documents in corpus variable first and after you compute your cosinus similarity.

Here an example :

tf_idf.py:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?',
]

vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(corpus)
words = vectorizer.get_feature_names()
similarity_matrix = cosine_similarity(tfidf)

Execute that in your ipython console :

In [1]: run tf_idf.py

In [2]: words
Out[2]: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [3]: tfidf.toarray()
Out[3]: 
array([[ 0.        ,  0.43877674,  0.54197657,  0.43877674,  0.        ,
         0.        ,  0.35872874,  0.        ,  0.43877674],
       [ 0.        ,  0.27230147,  0.        ,  0.27230147,  0.        ,
         0.85322574,  0.22262429,  0.        ,  0.27230147],
       [ 0.55280532,  0.        ,  0.        ,  0.        ,  0.55280532,
         0.        ,  0.28847675,  0.55280532,  0.        ],
       [ 0.        ,  0.43877674,  0.54197657,  0.43877674,  0.        ,
         0.        ,  0.35872874,  0.        ,  0.43877674]])

In [4]: similarity_matrix
Out[4]: 
array([[ 1.        ,  0.43830038,  0.1034849 ,  1.        ],
       [ 0.43830038,  1.        ,  0.06422193,  0.43830038],
       [ 0.1034849 ,  0.06422193,  1.        ,  0.1034849 ],
       [ 1.        ,  0.43830038,  0.1034849 ,  1.        ]])

Note :

  • tfidf is a scipy.sparse.csr.csr_matrix, to_array convert to a numpy.ndarray (but is is costly, just here to see easily the content).
  • similarity_matrix is a symetric matrix.

You can do:

import numpy as np
print(np.triu(similarity_matrix, k=1))

Give :

array([[ 0.        ,  0.43830038,  0.1034849 ,  1.        ],
       [ 0.        ,  0.        ,  0.06422193,  0.43830038],
       [ 0.        ,  0.        ,  0.        ,  0.1034849 ],
       [ 0.        ,  0.        ,  0.        ,  0.        ]]) 

To see only interesting similarities.

See :

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

like image 40
glegoux Avatar answered Oct 17 '22 21:10

glegoux