I'm following along with the scikit-learn LDA example here and am trying to understand how I can (if possible) surface how many documents have been labeled as having each one of these topics. I've been poring through the docs for the LDA model here but don't see where I could get this number. Has anyone been able to do this before with scikit-learn?
LDA calculates a list of topic probabilities for each document, so you may want to interpret the topic of a document as the topic with highest probability for that document.
If dtm
is your document-term matrix and lda
your Latent Dirichlet Allocation object , you can explore the topic mixtures with the transform()
function and pandas
:
docsVStopics = lda.transform(dtm)
docsVStopics = pd.DataFrame(docsVStopics, columns=["Topic"+str(i+1) for i in range(N_TOPICS)])
print("Created a (%dx%d) document-topic matrix." % (docsVStopics.shape[0], docsVStopics.shape[1]))
docsVStopics.head()
You can easily find the most likely topic for each document:
most_likely_topics = docsVStopics.idxmax(axis=1)
then get the counts:
most_likely_topics.groupby(most_likely_topics).count()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With