I'm following the 'English Wikipedia' gensim tutorial at https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation
where it explains that tf-idf is used during training (at least for LSA, not so clear with LDA).
I expected to apply a tf-idf transformer to new documents, but instead, at the end of the tut, it suggests to simply input a bag-of-words.
doc_lda = lda[doc_bow]
Does LDA require bag-of-words vectors only?
Choosing the top V words by TFIDF is an effective way to prune the vocabulary". This said, LDA does not need tf-idf to infer topics, but it can be useful and it can improve your results.
The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus. Let's create them. Gensim creates a unique id for each word in the document.
In general, you can save things with generic Python pickle , but most gensim models support their own native . save() method. It takes a target filesystem path, and will save the model more efficiently than pickle() – often by placing large component arrays in separate files, alongside the main file.
TL;DR: Yes, LDA only needs a bag-of-word vector.
Indeed, in the Wikipedia example of the gensim tutorial, Radim Rehurek uses the TF-IDF corpus generated in the preprocessing step.
mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
I believe the reason for that is only that this matrix is sparse and easy to handle (and already exists anyways due to the preprocessing step).
LDA does not necessarily need to be trained on a TF-IDF corpus. The model works just fine if you use the corpus shown in the gensim tutorial Corpora and Vector Spaces:
from gensim import corpora, models
texts = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1, chunksize =10000, passes=1)
Notice that texts
is a bag-of-word vector. As you pointed out correctly, that is the center piece of the LDA model. TF-IDF does not play any role in it at all.
In fact, Blei (who developed LDA), points out in the introduction of the paper of 2003 (entitled "Latent Dirichlet Allocation") that LDA addresses the shortcomings of the TF-IDF model and leaves this approach behind. LSA is compeltely algebraic and generally (but not necessarily) uses a TF-IDF matrix, while LDA is a probabilistic model that tries to estimate probability distributions for topics in documents and words in topics. The weighting of TF-IDF is not necessary for this.
Not to disagree with Jérôme's answer, tf-idf is used in the latent dirichlet allocation to some extent. As can be read in the paper Topic Models by Blei and Lafferty (e.g. p.6 - Visualizing Topics and p.12), the tf-idf score can be very useful for LDA. It can be used to visualize topics or to chose the vocabulary. "It is often computationally expensive to use the entire vocabulary. Choosing the top V words by TFIDF is an effective way to prune the vocabulary".
This said, LDA does not need tf-idf to infer topics, but it can be useful and it can improve your results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With