My goal is to input 3 queries and find out which query is most similar to a set of 5 documents.
So far I have calculated the tf-idf
of the documents doing the following:
from sklearn.feature_extraction.text import TfidfVectorizer
def get_term_frequency_inverse_data_frequency(documents):
allDocs = []
for document in documents:
allDocs.append(nlp.clean_tf_idf_text(document))
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(allDocs)
return matrix
def get_tf_idf_query_similarity(documents, query):
tfidf = get_term_frequency_inverse_data_frequency(documents)
The problem I am having is now that I have tf-idf
of the documents what operations do I perform on the query so I can find the cosine similarity to the documents?
From Python: tf-idf-cosine: to find document similarity , it is possible to calculate document similarity using tf-idf cosine.
The common way to compute the Cosine similarity is to first we need to count the word occurrence in each document. To count the word occurrence in each document, we can use CountVectorizer or TfidfVectorizer functions that are provided by Scikit-Learn library.
Here is my suggestion:
TfidfVectorizer
directly using preprocessing
attribute. from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
vectorizer = TfidfVectorizer(preprocessor=nlp.clean_tf_idf_text)
docs_tfidf = vectorizer.fit_transform(allDocs)
def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
"""
vectorizer: TfIdfVectorizer model
docs_tfidf: tfidf vectors for all docs
query: query doc
return: cosine similarity between query and all docs
"""
query_tfidf = vectorizer.transform([query])
cosineSimilarities = cosine_similarity(query_tfidf, docs_tfidf).flatten()
return cosineSimilarities
You can do as Nihal has written in his response or you can use the nearest neighbors algo from sklearn. You have to select the proper metric (cosine)
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=5, metric='cosine')
The other answers were very helpful but not entirely what I was looking for as they didn't help me transform my query so I could compare it with the documents.
To transform the query I first fit it to the document matrix:
queryTFIDF = TfidfVectorizer().fit(allDocs)
I then transform it into the matrix shape:
queryTFIDF = queryTFIDF.transform([query])
And then just calculate the cosine similarity between all the documents and my query using the sklearn.metrics.pairwise.cosine_similarity function
cosineSimilarities = cosine_similarity(queryTFIDF, docTFIDF).flatten()
Although I realise using Nihal's solution I could input my query as one of the documents and then calculated the similarity between it and the other documents but this is what worked best for me.
The full code ends up looking like:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def get_tf_idf_query_similarity(documents, query):
allDocs = []
for document in documents:
allDocs.append(nlp.clean_tf_idf_text(document))
docTFIDF = TfidfVectorizer().fit_transform(allDocs)
queryTFIDF = TfidfVectorizer().fit(allDocs)
queryTFIDF = queryTFIDF.transform([query])
cosineSimilarities = cosine_similarity(queryTFIDF, docTFIDF).flatten()
return cosineSimilarities
Cosine similarity is cosine of the angle between the vectors that represent documents.
K(X, Y) = <X, Y> / (||X||*||Y||)
Your tf-idf matrix will be a sparse matrix with dimensions = no. of documents * no. of distinct words.
To print the whole matrix you can use todense()
print(tfidf.todense())
Each row represents the vector representation corresponding to one document. Like wise each column corresponds to tf-idf score of unique word in the corpus.
Between a vector and any other vector the pairwise-similarity can be calculated from your tf-idf matrix as:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(reference_vector, tfidf_matrix)
The output will be a array of length = no. of documents indicating the similarity score between your reference vector and vector corresponding to each document. Of course the similarity between the reference vector and itself will be 1. Overall it will be a value between 0 and 1.
To find the similarity between first and second documents,
print(cosine_similarity(tfidf_matrix[0], tfidf_matrix[1]))
array([[0.36651513]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With