Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Necessary to apply TF-IDF to new documents in gensim LDA model?

Tags:

gensim

I'm following the 'English Wikipedia' gensim tutorial at https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation

where it explains that tf-idf is used during training (at least for LSA, not so clear with LDA).

I expected to apply a tf-idf transformer to new documents, but instead, at the end of the tut, it suggests to simply input a bag-of-words.

doc_lda = lda[doc_bow]

Does LDA require bag-of-words vectors only?

like image 383
Luke W Avatar asked Jun 27 '17 13:06

Luke W


People also ask

Does LDA use TF IDF?

Choosing the top V words by TFIDF is an effective way to prune the vocabulary". This said, LDA does not need tf-idf to infer topics, but it can be useful and it can improve your results.

What are the two main inputs to an LDA topic model using Gensim?

The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus. Let's create them. Gensim creates a unique id for each word in the document.

How do you save a Gensim Tfidf model?

In general, you can save things with generic Python pickle , but most gensim models support their own native . save() method. It takes a target filesystem path, and will save the model more efficiently than pickle() – often by placing large component arrays in separate files, alongside the main file.


2 Answers

TL;DR: Yes, LDA only needs a bag-of-word vector.

Indeed, in the Wikipedia example of the gensim tutorial, Radim Rehurek uses the TF-IDF corpus generated in the preprocessing step.

mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')

I believe the reason for that is only that this matrix is sparse and easy to handle (and already exists anyways due to the preprocessing step).

LDA does not necessarily need to be trained on a TF-IDF corpus. The model works just fine if you use the corpus shown in the gensim tutorial Corpora and Vector Spaces:

from gensim import corpora, models
texts = [['human', 'interface', 'computer'],
         ['survey', 'user', 'computer', 'system', 'response', 'time'],
         ['eps', 'user', 'interface', 'system'],
         ['system', 'human', 'system', 'eps'],
         ['user', 'response', 'time'],
         ['trees'],
         ['graph', 'trees'],
         ['graph', 'minors', 'trees'],
         ['graph', 'minors', 'survey']]

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1, chunksize =10000, passes=1)

Notice that texts is a bag-of-word vector. As you pointed out correctly, that is the center piece of the LDA model. TF-IDF does not play any role in it at all.

In fact, Blei (who developed LDA), points out in the introduction of the paper of 2003 (entitled "Latent Dirichlet Allocation") that LDA addresses the shortcomings of the TF-IDF model and leaves this approach behind. LSA is compeltely algebraic and generally (but not necessarily) uses a TF-IDF matrix, while LDA is a probabilistic model that tries to estimate probability distributions for topics in documents and words in topics. The weighting of TF-IDF is not necessary for this.

like image 197
Jérôme Bau Avatar answered Sep 24 '22 20:09

Jérôme Bau


Not to disagree with Jérôme's answer, tf-idf is used in the latent dirichlet allocation to some extent. As can be read in the paper Topic Models by Blei and Lafferty (e.g. p.6 - Visualizing Topics and p.12), the tf-idf score can be very useful for LDA. It can be used to visualize topics or to chose the vocabulary. "It is often computationally expensive to use the entire vocabulary. Choosing the top V words by TFIDF is an effective way to prune the vocabulary".

This said, LDA does not need tf-idf to infer topics, but it can be useful and it can improve your results.

like image 39
bbrinx Avatar answered Sep 20 '22 20:09

bbrinx