Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python tf-idf: fast way to update the tf-idf matrix

I have a dataset of several thousand rows of text, my target is to calculate the tfidf score and then cosine similarity between documents, this is what I did using gensim in Python followed the tutorial:

dictionary = corpora.Dictionary(dat)
corpus = [dictionary.doc2bow(text) for text in dat]

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
index = similarities.MatrixSimilarity(corpus_tfidf)

Let's say we have the tfidf matrix and similarity built, when we have a new document come in, I want to query for its most similar document in our existing dataset.

Question: is there any way we can update the tf-idf matrix so that I don't have to append the new text doc to the original dataset and recalculate the whole thing again?

like image 715
snowneji Avatar asked Feb 13 '17 19:02

snowneji


2 Answers

I'll post my solution since there are no other answers. Let's say we are in the following scenario:

import gensim
from gensim import models
from gensim import corpora
from gensim import similarities
from nltk.tokenize import word_tokenize
import pandas as pd

# routines:
text = "I work on natural language processing and I want to figure out how does gensim work"
text2 = "I love computer science and I code in Python"
dat = pd.Series([text,text2])
dat = dat.apply(lambda x: str(x).lower()) 
dat = dat.apply(lambda x: word_tokenize(x))


dictionary = corpora.Dictionary(dat)
corpus = [dictionary.doc2bow(doc) for doc in dat]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]


#Query:
query_text = "I love icecream and gensim"
query_text = query_text.lower()
query_text = word_tokenize(query_text)
vec_bow = dictionary.doc2bow(query_text)
vec_tfidf = tfidf[vec_bow]

if we look at:

print(vec_bow)
[(0, 1), (7, 1), (12, 1), (15, 1)]

and:

print(tfidf[vec_bow])
[(12, 0.7071067811865475), (15, 0.7071067811865475)]

FYI id and doc:

print(dictionary.items())

[(0, u'and'),
 (1, u'on'),
 (8, u'processing'),
 (3, u'natural'),
 (4, u'figure'),
 (5, u'language'),
 (9, u'how'),
 (7, u'i'),
 (14, u'code'),
 (19, u'in'),
 (2, u'work'),
 (16, u'python'),
 (6, u'to'),
 (10, u'does'),
 (11, u'want'),
 (17, u'science'),
 (15, u'love'),
 (18, u'computer'),
 (12, u'gensim'),
 (13, u'out')]

Looks like the query only picked up existing terms and using pre-calculated weights to give you the tfidf score. So my workaround is to rebuild the model weekly or daily since it is fast to do so.

like image 68
snowneji Avatar answered Sep 30 '22 23:09

snowneji


Let me share my thoughts.

One thing is Corpus, another thing is Model and another thing is Query. I would say that sometimes is easy to mix them.

1) Corpus and Models

A Corpus is a set of documents, your library, where each document is represented in a certain format. For example, a Corpus_BOW represents your documents as a Bag of Words. A Corpus_TFIDF represents your documents by their TFIDF.

A Model is something that transforms a Corpus representation into another. For example, Model_TFIDF transform Corpus_BOW --> Corpus_TFIDF. You can have other models, for example a model for Corpus_TFIDF --> Corpus_LSI or Corpus_BOW --> Corpus_LSI.

I would say this is the main nature of the wonderful Gensim, to be a Corpus transformator. And the objective is to find that corpus representation that better represents similarities between documents for your application.

A couple of important ideas:

  • First, the Model is always built from the entry Corpus, for example: Model_TFIDF = models.TfidfModel(Corpus_BOW, id2word = yourDictionary)
  • Second, if you want your corpus in a format (Corpus_TFIDF), you need first to build the model (Model_TFIDF) and then transform your entry corpus: Corpus_TFIDF = Model_TFIDF[Corpus_BOW].

So, we first build the model with the entry corpus, and then apply the model to the same entry corpus, to obtain the output corpus. Perhaps some steps could be joined, but these are the conceptual steps.

2) Queries and Updates

A given model can be applied to new documents, to obtain the new documents TFIDF. For example, New_Corpus_TFIDF = Model_TFIDF[New_Corpus_BOW]. But this is just Querying. The Model is not updated with the new corpus/documents. That is, the model is modeled with the original corpus, and used, as it was, with the new documents.

This is useful when the new document is just a short user query and we want to find the most similar documents in our original corpus. Or when we have just a single new document and we want to find the most similar ones in our corpus. In these cases, if your corpus is large enough, you don't need to update the model.

But let say your library, your corpus, is something alive. And you want to update your models with new documents, as if they were since the beginning. Some models can be updated just giving the new documents. For example models.LsiModel has "add_documents" method for including new Corpus in your LSI model (so if you built it with Corpus_BOW, you can just update giving New_Corpus_BOW).

But TFIDF model hasn't this "add_documents" method. I don't know if there is a complex and smart mathematical way to overcome this, but the thing is that the "IDF" part of TFIDF depends on the full Corpus (previous and new). So, if you add a new document, then the IDF of every previous document changes. The only way to update TFIDF model is to recalculated it again.

In any case, consider that even if you can update a model, then you need to apply it again to your entry corpus to have the output corpus, and rebuilt similarities.

As someone says before, if your library is large enough, you can use the original TFIDF model and apply to new documents, as it is, without updating the model. Probably results are good enough. Then, time to time, when the number of new documents is large, you re-build again the TFIDF model.

like image 44
rafael alonso Avatar answered Sep 30 '22 23:09

rafael alonso