I have a TF-IDF matrix of a dataset of products:
tfidf = TfidfVectorizer().fit_transform(words)
where words is a list of descriptions. This produces a 69258x22024 matrix.
Now I want to find cosine similarities between a new product and the ones in the matrix, as I need to find the 10 most similar products to it. I vectorize it using the same method above.
However, I cannot multiply the matrices because their sizes are different (the new one would be like 6 words, so a 1x6 matrix), so I need to make a TFIDFVectorizer with the number of columns as the original one.
How do I do it?
Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Similarity = (A.B) / (||A||. ||B||) where A and B are vectors.
TF-IDF will give you a representation for a given term in a document. Cosine similarity will give you a score for two different documents that share the same representation. However, "one of the simplest ranking functions is computed by summing the tf–idf for each query term".
I have found a way for it to work. Instead of using fit_transform, you need to first fit the new document to the corpus TFIDF matrix like this:
queryTFIDF = TfidfVectorizer().fit(words)
Now we can 'transform' this vector into that matrix shape by using the transform function:
queryTFIDF = queryTFIDF.transform([query])
Where query is the query string.
We can then find cosine similarities and find the 10 most similar/relevant documents:
cosine_similarities = cosine_similarity(queryTFIDF, datasetTFIDF).flatten()
related_product_indices = cosine_similarities.argsort()[:-11:-1]
I think words
variable is ambiguous. I advise you to rename words
to corpus
.
In fact you put all your documents in corpus
variable first and after you compute your cosinus similarity.
Here an example :
tf_idf.py:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(corpus)
words = vectorizer.get_feature_names()
similarity_matrix = cosine_similarity(tfidf)
Execute that in your ipython
console :
In [1]: run tf_idf.py
In [2]: words
Out[2]: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
In [3]: tfidf.toarray()
Out[3]:
array([[ 0. , 0.43877674, 0.54197657, 0.43877674, 0. ,
0. , 0.35872874, 0. , 0.43877674],
[ 0. , 0.27230147, 0. , 0.27230147, 0. ,
0.85322574, 0.22262429, 0. , 0.27230147],
[ 0.55280532, 0. , 0. , 0. , 0.55280532,
0. , 0.28847675, 0.55280532, 0. ],
[ 0. , 0.43877674, 0.54197657, 0.43877674, 0. ,
0. , 0.35872874, 0. , 0.43877674]])
In [4]: similarity_matrix
Out[4]:
array([[ 1. , 0.43830038, 0.1034849 , 1. ],
[ 0.43830038, 1. , 0.06422193, 0.43830038],
[ 0.1034849 , 0.06422193, 1. , 0.1034849 ],
[ 1. , 0.43830038, 0.1034849 , 1. ]])
Note :
tfidf
is a scipy.sparse.csr.csr_matrix
, to_array
convert to a numpy.ndarray
(but is is costly, just here to see easily the content).You can do:
import numpy as np
print(np.triu(similarity_matrix, k=1))
Give :
array([[ 0. , 0.43830038, 0.1034849 , 1. ],
[ 0. , 0. , 0.06422193, 0.43830038],
[ 0. , 0. , 0. , 0.1034849 ],
[ 0. , 0. , 0. , 0. ]])
To see only interesting similarities.
See :
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With