Necessary to apply TF-IDF to new documents in gensim LDA model?

Tags:

gensim

I'm following the 'English Wikipedia' gensim tutorial at https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation

where it explains that tf-idf is used during training (at least for LSA, not so clear with LDA).

I expected to apply a tf-idf transformer to new documents, but instead, at the end of the tut, it suggests to simply input a bag-of-words.

doc_lda = lda[doc_bow]

Does LDA require bag-of-words vectors only?

383

asked Jun 27 '17 13:06

2 Answers

TL;DR: Yes, LDA only needs a bag-of-word vector.

Indeed, in the Wikipedia example of the gensim tutorial, Radim Rehurek uses the TF-IDF corpus generated in the preprocessing step.

mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')

I believe the reason for that is only that this matrix is sparse and easy to handle (and already exists anyways due to the preprocessing step).

LDA does not necessarily need to be trained on a TF-IDF corpus. The model works just fine if you use the corpus shown in the gensim tutorial Corpora and Vector Spaces:

from gensim import corpora, models
texts = [['human', 'interface', 'computer'],
         ['survey', 'user', 'computer', 'system', 'response', 'time'],
         ['eps', 'user', 'interface', 'system'],
         ['system', 'human', 'system', 'eps'],
         ['user', 'response', 'time'],
         ['trees'],
         ['graph', 'trees'],
         ['graph', 'minors', 'trees'],
         ['graph', 'minors', 'survey']]

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1, chunksize =10000, passes=1)

Notice that texts is a bag-of-word vector. As you pointed out correctly, that is the center piece of the LDA model. TF-IDF does not play any role in it at all.

In fact, Blei (who developed LDA), points out in the introduction of the paper of 2003 (entitled "Latent Dirichlet Allocation") that LDA addresses the shortcomings of the TF-IDF model and leaves this approach behind. LSA is compeltely algebraic and generally (but not necessarily) uses a TF-IDF matrix, while LDA is a probabilistic model that tries to estimate probability distributions for topics in documents and words in topics. The weighting of TF-IDF is not necessary for this.

197

answered Sep 24 '22 20:09

Jérôme Bau

Not to disagree with Jérôme's answer, tf-idf is used in the latent dirichlet allocation to some extent. As can be read in the paper Topic Models by Blei and Lafferty (e.g. p.6 - Visualizing Topics and p.12), the tf-idf score can be very useful for LDA. It can be used to visualize topics or to chose the vocabulary. "It is often computationally expensive to use the entire vocabulary. Choosing the top V words by TFIDF is an effective way to prune the vocabulary".

This said, LDA does not need tf-idf to infer topics, but it can be useful and it can improve your results.

answered Sep 20 '22 20:09

bbrinx

Related questions
                            
                                Gensim: how to load precomputed word vectors from text file
                            
                                doc2vec: How is PV-DBOW implemented
                            
                                Docker unable to install numpy, scipy, or gensim
                            
                                How to predict the topic of a new query using a trained LDA model using gensim?
                            
                                Why Doc2vec gives 2 different vectors for the same texts
                            
                                gensim LdaMulticore not multiprocessing?
                            
                                Can we use a self made corpus for training for LDA using gensim?
                            
                                Doc2Vec Worse Than Mean or Sum of Word2Vec Vectors
                            
                                Improving Gensim Doc2vec results
                            
                                Using pretrained gensim Word2vec embedding in keras
                            
                                Gensim: How to save LDA model's produced topics to a readable format (csv,txt,etc)?
                            
                                Why Gensim doc2vec give AttributeError: 'list' object has no attribute 'words'?
                            
                                Word2vec training using gensim starts swapping after 100K sentences
                            
                                gensim: pickle or not?
                            
                                Retrieve string version of document by ID in Gensim
                            
                                How to do Text classification using word2vec
                            
                                Error while loading Word2Vec model in gensim
                            
                                Understanding parameters in Gensim LDA Model
                            
                                How to use the infer_vector in gensim.doc2vec?
                            
                                Understanding LDA / topic modelling -- too much topic overlap

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With