Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using sklearn how do I calculate the tf-idf cosine similarity between documents and a query?

My goal is to input 3 queries and find out which query is most similar to a set of 5 documents.

So far I have calculated the tf-idf of the documents doing the following:

from sklearn.feature_extraction.text import TfidfVectorizer

def get_term_frequency_inverse_data_frequency(documents):
    allDocs = []
    for document in documents:
        allDocs.append(nlp.clean_tf_idf_text(document))
    vectorizer = TfidfVectorizer()
    matrix = vectorizer.fit_transform(allDocs)
    return matrix

def get_tf_idf_query_similarity(documents, query):
    tfidf = get_term_frequency_inverse_data_frequency(documents)

The problem I am having is now that I have tf-idf of the documents what operations do I perform on the query so I can find the cosine similarity to the documents?

like image 282
OultimoCoder Avatar asked Apr 14 '19 16:04

OultimoCoder


People also ask

How do you find the cosine similarity between two documents in Python?

From Python: tf-idf-cosine: to find document similarity , it is possible to calculate document similarity using tf-idf cosine.

How do you find the cosine similarity between two documents?

The common way to compute the Cosine similarity is to first we need to count the word occurrence in each document. To count the word occurrence in each document, we can use CountVectorizer or TfidfVectorizer functions that are provided by Scikit-Learn library.


4 Answers

Here is my suggestion:

  • We don't have to fit the model twice. we could reuse the same vectorizer
  • text cleaning function can be plugged into TfidfVectorizer directly using preprocessing attribute.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer(preprocessor=nlp.clean_tf_idf_text)
docs_tfidf = vectorizer.fit_transform(allDocs)

def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
    """
    vectorizer: TfIdfVectorizer model
    docs_tfidf: tfidf vectors for all docs
    query: query doc

    return: cosine similarity between query and all docs
    """
    query_tfidf = vectorizer.transform([query])
    cosineSimilarities = cosine_similarity(query_tfidf, docs_tfidf).flatten()
    return cosineSimilarities
like image 128
Venkatachalam Avatar answered Oct 16 '22 17:10

Venkatachalam


You can do as Nihal has written in his response or you can use the nearest neighbors algo from sklearn. You have to select the proper metric (cosine)

from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=5, metric='cosine')
like image 31
AdForte Avatar answered Oct 16 '22 18:10

AdForte


The other answers were very helpful but not entirely what I was looking for as they didn't help me transform my query so I could compare it with the documents.

To transform the query I first fit it to the document matrix:

queryTFIDF = TfidfVectorizer().fit(allDocs)

I then transform it into the matrix shape:

queryTFIDF = queryTFIDF.transform([query])

And then just calculate the cosine similarity between all the documents and my query using the sklearn.metrics.pairwise.cosine_similarity function

cosineSimilarities = cosine_similarity(queryTFIDF, docTFIDF).flatten()

Although I realise using Nihal's solution I could input my query as one of the documents and then calculated the similarity between it and the other documents but this is what worked best for me.

The full code ends up looking like:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def get_tf_idf_query_similarity(documents, query):
    allDocs = []
    for document in documents:
        allDocs.append(nlp.clean_tf_idf_text(document))
    docTFIDF = TfidfVectorizer().fit_transform(allDocs)
    queryTFIDF = TfidfVectorizer().fit(allDocs)
    queryTFIDF = queryTFIDF.transform([query])

    cosineSimilarities = cosine_similarity(queryTFIDF, docTFIDF).flatten()
    return cosineSimilarities
like image 27
OultimoCoder Avatar answered Oct 16 '22 16:10

OultimoCoder


Cosine similarity is cosine of the angle between the vectors that represent documents.

K(X, Y) = <X, Y> / (||X||*||Y||)

Your tf-idf matrix will be a sparse matrix with dimensions = no. of documents * no. of distinct words.

To print the whole matrix you can use todense()

print(tfidf.todense())

Each row represents the vector representation corresponding to one document. Like wise each column corresponds to tf-idf score of unique word in the corpus.

Between a vector and any other vector the pairwise-similarity can be calculated from your tf-idf matrix as:

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(reference_vector, tfidf_matrix) 

The output will be a array of length = no. of documents indicating the similarity score between your reference vector and vector corresponding to each document. Of course the similarity between the reference vector and itself will be 1. Overall it will be a value between 0 and 1.

To find the similarity between first and second documents,

print(cosine_similarity(tfidf_matrix[0], tfidf_matrix[1]))

array([[0.36651513]])
like image 34
Nihal Sangeeth Avatar answered Oct 16 '22 16:10

Nihal Sangeeth