Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get document_topics distribution of all of the document in gensim LDA?

I'm new to python and I need to construct a LDA project. After doing some preprocessing step, here is my code:

dictionary = Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]

from gensim.models import LdaModel
num_topics = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = None
temp = dictionary[0]
id2word = dictionary.id2token
model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
                       alpha='auto', eta='auto', \
                       random_state=42, \
                       iterations=iterations, num_topics=num_topics, \
                       passes=passes, eval_every=eval_every)

I want to get a topic distribution of docs, all of the document and get 10 probability of topic distribution, but when I use:

get_document_topics = model.get_document_topics(corpus)
print(get_document_topics)

The output only appear

<gensim.interfaces.TransformedCorpus object at 0x000001DF28708E10>

How do I get a topic distribution of docs?

like image 562
wayne64001 Avatar asked Nov 15 '18 06:11

wayne64001


People also ask

How do I know how many topics in LDA?

To decide on a suitable number of topics, you can compare the goodness-of-fit of LDA models fit with varying numbers of topics. You can evaluate the goodness-of-fit of an LDA model by calculating the perplexity of a held-out set of documents. The perplexity indicates how well the model describes a set of documents.

What is passes in LDA Gensim?

Passes is the number of times you want to go through the entire corpus. Below are a few examples of different combinations of the 3 parameters and the number of online training updates which will occur while training LDA.

What is perplexity LDA?

Perplexity as well is one of the intrinsic evaluation metric, and is widely used for language model evaluation. It captures how surprised a model is of new data it has not seen before, and is measured as the normalized log-likelihood of a held-out test set.

What is LdaMallet?

LdaMallet uses an optimized Gibbs sampling algorithm for Latent Dirichlet Allocation [2]. This is the reason for different parameters. However, most of the parameters, e.g., the number of topics, alpha and (b)eta) are shared between both algorithms because both implement LDA.


1 Answers

The function get_document_topics takes an input of a single document in BOW format. You're calling it on the full corpus (an array of documents) so it returns an iterable object with the scores for each document.

You have a few options. If you just want one document, run it on the document you want the values for:

get_document_topics = model.get_document_topics(corpus[0])

or do the following to get an array of scores for all the documents:

get_document_topics = [model.get_document_topics(item) for item in corpus]

Or directly access each object from your original code:

get_document_topics = model.get_document_topics(corpus)
print(get_document_topics[0])
like image 192
Andrew McDowell Avatar answered Oct 28 '22 10:10

Andrew McDowell