I'm new to python and I need to construct a LDA project. After doing some preprocessing step, here is my code:
dictionary = Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]
from gensim.models import LdaModel
num_topics = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = None
temp = dictionary[0]
id2word = dictionary.id2token
model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
alpha='auto', eta='auto', \
random_state=42, \
iterations=iterations, num_topics=num_topics, \
passes=passes, eval_every=eval_every)
I want to get a topic distribution of docs, all of the document and get 10 probability of topic distribution, but when I use:
get_document_topics = model.get_document_topics(corpus)
print(get_document_topics)
The output only appear
<gensim.interfaces.TransformedCorpus object at 0x000001DF28708E10>
How do I get a topic distribution of docs?
To decide on a suitable number of topics, you can compare the goodness-of-fit of LDA models fit with varying numbers of topics. You can evaluate the goodness-of-fit of an LDA model by calculating the perplexity of a held-out set of documents. The perplexity indicates how well the model describes a set of documents.
Passes is the number of times you want to go through the entire corpus. Below are a few examples of different combinations of the 3 parameters and the number of online training updates which will occur while training LDA.
Perplexity as well is one of the intrinsic evaluation metric, and is widely used for language model evaluation. It captures how surprised a model is of new data it has not seen before, and is measured as the normalized log-likelihood of a held-out test set.
LdaMallet uses an optimized Gibbs sampling algorithm for Latent Dirichlet Allocation [2]. This is the reason for different parameters. However, most of the parameters, e.g., the number of topics, alpha and (b)eta) are shared between both algorithms because both implement LDA.
The function get_document_topics
takes an input of a single document in BOW format. You're calling it on the full corpus (an array of documents) so it returns an iterable object with the scores for each document.
You have a few options. If you just want one document, run it on the document you want the values for:
get_document_topics = model.get_document_topics(corpus[0])
or do the following to get an array of scores for all the documents:
get_document_topics = [model.get_document_topics(item) for item in corpus]
Or directly access each object from your original code:
get_document_topics = model.get_document_topics(corpus)
print(get_document_topics[0])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With