finding number of documents per topic for LDA with scikit-learn

Question

I'm following along with the scikit-learn LDA example here and am trying to understand how I can (if possible) surface how many documents have been labeled as having each one of these topics. I've been poring through the docs for the LDA model here but don't see where I could get this number. Has anyone been able to do this before with scikit-learn?

Patrizio G · Accepted Answer

LDA calculates a list of topic probabilities for each document, so you may want to interpret the topic of a document as the topic with highest probability for that document.

If dtm is your document-term matrix and lda your Latent Dirichlet Allocation object , you can explore the topic mixtures with the transform() function and pandas:

docsVStopics = lda.transform(dtm)
docsVStopics = pd.DataFrame(docsVStopics, columns=["Topic"+str(i+1) for i in range(N_TOPICS)])
print("Created a (%dx%d) document-topic matrix." % (docsVStopics.shape[0], docsVStopics.shape[1]))
docsVStopics.head()

You can easily find the most likely topic for each document:

most_likely_topics = docsVStopics.idxmax(axis=1)

then get the counts:

 most_likely_topics.groupby(most_likely_topics).count()

finding number of documents per topic for LDA with scikit-learn

Tags:

scikit-learn

lda

user139014

1 Answers

Patrizio G

Recent Activity

Donate For Us

finding number of documents per topic for LDA with scikit-learn

Tags:

scikit-learn

lda

user139014

1 Answers

Patrizio G

Related questions

Recent Activity

Donate For Us