Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

finding number of documents per topic for LDA with scikit-learn

I'm following along with the scikit-learn LDA example here and am trying to understand how I can (if possible) surface how many documents have been labeled as having each one of these topics. I've been poring through the docs for the LDA model here but don't see where I could get this number. Has anyone been able to do this before with scikit-learn?

like image 902
user139014 Avatar asked Feb 07 '16 11:02

user139014


1 Answers

LDA calculates a list of topic probabilities for each document, so you may want to interpret the topic of a document as the topic with highest probability for that document.

If dtm is your document-term matrix and lda your Latent Dirichlet Allocation object , you can explore the topic mixtures with the transform() function and pandas:

docsVStopics = lda.transform(dtm)
docsVStopics = pd.DataFrame(docsVStopics, columns=["Topic"+str(i+1) for i in range(N_TOPICS)])
print("Created a (%dx%d) document-topic matrix." % (docsVStopics.shape[0], docsVStopics.shape[1]))
docsVStopics.head()

You can easily find the most likely topic for each document:

most_likely_topics = docsVStopics.idxmax(axis=1)

then get the counts:

 most_likely_topics.groupby(most_likely_topics).count()
like image 84
Patrizio G Avatar answered Sep 20 '22 16:09

Patrizio G