Using sklearn how do I calculate the tf-idf cosine similarity between documents and a query?

Tags:

My goal is to input 3 queries and find out which query is most similar to a set of 5 documents.

So far I have calculated the tf-idf of the documents doing the following:

from sklearn.feature_extraction.text import TfidfVectorizer

def get_term_frequency_inverse_data_frequency(documents):
    allDocs = []
    for document in documents:
        allDocs.append(nlp.clean_tf_idf_text(document))
    vectorizer = TfidfVectorizer()
    matrix = vectorizer.fit_transform(allDocs)
    return matrix

def get_tf_idf_query_similarity(documents, query):
    tfidf = get_term_frequency_inverse_data_frequency(documents)

The problem I am having is now that I have tf-idf of the documents what operations do I perform on the query so I can find the cosine similarity to the documents?

282

asked Apr 14 '19 16:04

4 Answers

Here is my suggestion:

We don't have to fit the model twice. we could reuse the same vectorizer
text cleaning function can be plugged into TfidfVectorizer directly using preprocessing attribute.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer(preprocessor=nlp.clean_tf_idf_text)
docs_tfidf = vectorizer.fit_transform(allDocs)

def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
    """
    vectorizer: TfIdfVectorizer model
    docs_tfidf: tfidf vectors for all docs
    query: query doc

    return: cosine similarity between query and all docs
    """
    query_tfidf = vectorizer.transform([query])
    cosineSimilarities = cosine_similarity(query_tfidf, docs_tfidf).flatten()
    return cosineSimilarities

128

answered Oct 16 '22 17:10

OultimoCoder

Cosine similarity is cosine of the angle between the vectors that represent documents.

K(X, Y) = <X, Y> / (||X||*||Y||)

Your tf-idf matrix will be a sparse matrix with dimensions = no. of documents * no. of distinct words.

To print the whole matrix you can use todense()

print(tfidf.todense())

Each row represents the vector representation corresponding to one document. Like wise each column corresponds to tf-idf score of unique word in the corpus.

Between a vector and any other vector the pairwise-similarity can be calculated from your tf-idf matrix as:

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(reference_vector, tfidf_matrix)

The output will be a array of length = no. of documents indicating the similarity score between your reference vector and vector corresponding to each document. Of course the similarity between the reference vector and itself will be 1. Overall it will be a value between 0 and 1.

To find the similarity between first and second documents,

print(cosine_similarity(tfidf_matrix[0], tfidf_matrix[1]))

array([[0.36651513]])

answered Oct 16 '22 16:10

Nihal Sangeeth

Related questions
                            
                                Scoring in Gridsearch CV
                            
                                Python Round Function Issues with pyspark
                            
                                Make pandas plot() show xlabel and xvalues
                            
                                Find input that maximises output of a neural network using Keras and TensorFlow
                            
                                Python - Pandas, Resample dataset to have balanced classes
                            
                                ERROR: The Python ssl extension was not compiled. Missing the OpenSSL lib? (installing python 2.7 on ubuntu 18.04)
                            
                                Python - package not found although it is installed
                            
                                Replace elements in numpy array avoiding loops
                            
                                Pandas: Find the max value in one column containing lists
                            
                                Pythonic way to fill rows with date range
                            
                                Python (Flask) serving Angular project's index.html file
                            
                                Does np.dot automatically transpose vectors?
                            
                                AttributeError: type object 'numpy.ndarray' has no attribute '__array_function__'
                            
                                How do I import a Python lambda layer?
                            
                                How to disable csrf for a view with flask-wft for a restapi?
                            
                                Including captcha in a django form
                            
                                Python multiprocessing pool: maxtasksperchild
                            
                                Understanding Tensorflow control dependencies
                            
                                Bitwise or "|" versus addition "+" for positive powers of two in Python
                            
                                virtual real time limit (178/120s) reached

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using sklearn how do I calculate the tf-idf cosine similarity between documents and a query?

Tags:

python

scikit-learn

cosine-similarity

tf-idf

OultimoCoder

People also ask

4 Answers

Venkatachalam

AdForte

OultimoCoder

Nihal Sangeeth

Recent Activity

Donate For Us