TD-IDF Find Cosine Similarity Between New Document and Dataset

Tags:

I have a TF-IDF matrix of a dataset of products:

tfidf = TfidfVectorizer().fit_transform(words)

where words is a list of descriptions. This produces a 69258x22024 matrix.

Now I want to find cosine similarities between a new product and the ones in the matrix, as I need to find the 10 most similar products to it. I vectorize it using the same method above.

However, I cannot multiply the matrices because their sizes are different (the new one would be like 6 words, so a 1x6 matrix), so I need to make a TFIDFVectorizer with the number of columns as the original one.

How do I do it?

353

asked Jul 01 '17 15:07

Mohamed Oun

2 Answers

I have found a way for it to work. Instead of using fit_transform, you need to first fit the new document to the corpus TFIDF matrix like this:

queryTFIDF = TfidfVectorizer().fit(words)

Now we can 'transform' this vector into that matrix shape by using the transform function:

queryTFIDF = queryTFIDF.transform([query])

Where query is the query string.
We can then find cosine similarities and find the 10 most similar/relevant documents:

cosine_similarities = cosine_similarity(queryTFIDF, datasetTFIDF).flatten()
related_product_indices = cosine_similarities.argsort()[:-11:-1]

112

answered Oct 17 '22 22:10

Mohamed Oun

I think words variable is ambiguous. I advise you to rename words to corpus.

In fact you put all your documents in corpus variable first and after you compute your cosinus similarity.

Here an example :

tf_idf.py:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?',
]

vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(corpus)
words = vectorizer.get_feature_names()
similarity_matrix = cosine_similarity(tfidf)

Execute that in your ipython console :

In [1]: run tf_idf.py

In [2]: words
Out[2]: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [3]: tfidf.toarray()
Out[3]: 
array([[ 0.        ,  0.43877674,  0.54197657,  0.43877674,  0.        ,
         0.        ,  0.35872874,  0.        ,  0.43877674],
       [ 0.        ,  0.27230147,  0.        ,  0.27230147,  0.        ,
         0.85322574,  0.22262429,  0.        ,  0.27230147],
       [ 0.55280532,  0.        ,  0.        ,  0.        ,  0.55280532,
         0.        ,  0.28847675,  0.55280532,  0.        ],
       [ 0.        ,  0.43877674,  0.54197657,  0.43877674,  0.        ,
         0.        ,  0.35872874,  0.        ,  0.43877674]])

In [4]: similarity_matrix
Out[4]: 
array([[ 1.        ,  0.43830038,  0.1034849 ,  1.        ],
       [ 0.43830038,  1.        ,  0.06422193,  0.43830038],
       [ 0.1034849 ,  0.06422193,  1.        ,  0.1034849 ],
       [ 1.        ,  0.43830038,  0.1034849 ,  1.        ]])

Note :

tfidf is a scipy.sparse.csr.csr_matrix, to_array convert to a numpy.ndarray (but is is costly, just here to see easily the content).
similarity_matrix is a symetric matrix.

You can do:

import numpy as np
print(np.triu(similarity_matrix, k=1))

Give :

array([[ 0.        ,  0.43830038,  0.1034849 ,  1.        ],
       [ 0.        ,  0.        ,  0.06422193,  0.43830038],
       [ 0.        ,  0.        ,  0.        ,  0.1034849 ],
       [ 0.        ,  0.        ,  0.        ,  0.        ]])

To see only interesting similarities.

See :

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

answered Oct 17 '22 21:10

glegoux

Related questions
                            
                                Dendrogram or Other Plot from Distance Matrix
                            
                                pandas randomly replace k percent
                            
                                How to display a plot in fullscreen
                            
                                Print either an integer or a float with n decimals
                            
                                Why use find_element(By...) instead of find_element_by_
                            
                                Pandas: Read specific Excel cell value into a variable
                            
                                os.walk stop looking on subdirectories after first finding
                            
                                python : how to change audio volume?
                            
                                How to find number of Mondays or any other weekday between two dates in Python?
                            
                                Python byte array to bit array
                            
                                How do you rename all columns in multi level group by in pandas 0.20.1+
                            
                                Pair plot with heat maps (possibly logarithmic)?
                            
                                set environment variables by file using python
                            
                                What is the time complexity of popping an element from a dict in Python?
                            
                                Pydub - How to change frame rate without changing playback speed
                            
                                Python - Using pandas to format excel cell
                            
                                how to generate a floating point random number with two precision in python
                            
                                Custom Titlebar with frame in PyQt5
                            
                                How to set file permissions in Python3?
                            
                                Python: any() unexpected performance

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

TD-IDF Find Cosine Similarity Between New Document and Dataset

Tags:

python

machine-learning

scikit-learn

tf-idf

Mohamed Oun

People also ask

2 Answers

Mohamed Oun

glegoux

Recent Activity

Donate For Us