Python - tf-idf predict a new document similarity

Tags:

Inspired by this answer, I'm trying to find cosine similarity between a trained trained tf-idf vectorizer and a new document, and return the similar documents.

The code below finds the cosine similarity of the first vector and not a new query

>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1.        ,  0.04405952,  0.11016969, ...,  0.04433602,
    0.04457106,  0.03293218])

Since my train data is huge, looping through the entire trained vectorizer sounds like a bad idea. How can I infer the vector of a new document, and find the related docs, same as the code below?

>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([    0,   958, 10576,  3277])
>>> cosine_similarities[related_docs_indices]
array([ 1.        ,  0.54967926,  0.32902194,  0.2825788 ])

991

asked Sep 25 '16 16:09

Shlomi Schwartz

2 Answers

This problem can be partially addressed by combining the vector space model (which is the tf-idf & cosine similarity) together with the boolean model. These are concepts of information theory and they are used (and nicely explained) in ElasticSearch- a pretty good search engine.

The idea is simple: you store your documents as inverted indices. Which is comparable to the words present at the end of a book, which hold a reference to the pages (documents) they were mentioned in.

Instead of calculating the tf-idf vector for all document it will only calculate it for documents that have at least one (or specify a threshold) of the words in common. This can be simply done by looping over the words in the queried document, finding the documents that also have this word using the inverted index and calculate the similarity for those.

197

answered Oct 17 '22 18:10

DJanssens

You should take a look at gensim. Example starting code looks like this:

from gensim import corpora, models, similarities

dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt'))
corpus = [dictionary.doc2bow(line.lower().split()) for line in open('corpus.txt')]

tfidf = models.TfidfModel(corpus)
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)

At prediction time you first get the vector for the new doc:

doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_tfidf = tfidf[vec_bow]

Then get the similarities (sorted by most similar):

sims = index[vec_tfidf] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

This does a linear scan like you wanted to do but they have a more optimized implementation. If the speed is not enough then you can look into approximate similarity search (Annoy, Falconn, NMSLIB).

answered Oct 17 '22 19:10

elyase

Related questions
                            
                                How to build a sparse matrix in PySpark?
                            
                                How to make "conda" installer look for "PyPi" packages
                            
                                Optimizing Array Element Shifting in Python / Numpy
                            
                                Dynamic radio buttons from database query using Flask and WTForms
                            
                                Alternatives to numpy einsum
                            
                                Does Dask support functions with multiple outputs in Custom Graphs?
                            
                                Running a cron job in Elastic Beanstalk
                            
                                change urlparse.path of a url
                            
                                How do I add multiple markers to a stripplot in seaborn?
                            
                                How to execute code asynchronously in Twisted Klein?
                            
                                How to get dict of lists from relationship in sqlalchemy?
                            
                                UnicodeDecodeError Loading with sqlalchemy
                            
                                Storing and using a trained neural network
                            
                                Modify pandas group
                            
                                Pip install-couldn't find a version that satisfies the requirement
                            
                                Collocations with spaCy
                            
                                Fast hash for 2 coordinates where order doesn't matter?
                            
                                Can you dynamically add class attributes/variables to a subclass in python?
                            
                                python: nested classes: access outer class class member
                            
                                ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'numpy.ndarray'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python - tf-idf predict a new document similarity

Tags:

python

machine-learning

scikit-learn

tf-idf

document-classification

Shlomi Schwartz

People also ask

2 Answers

DJanssens

elyase

Recent Activity

Donate For Us