I am now going through LDA(Latent Dirichlet Allocation) Topic modelling method to help in extraction of topics from a set of documents. As from what I have understood from the link below, this is an unsupervised learning approach to categorize / label each of the documents with the extracted topics.
Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation
In the sample code given in that link, there is a function defined to get the top words associated with each of the topic identified.
sklearn.__version__
Out[41]: '0.17'
from sklearn.decomposition import LatentDirichletAllocation
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
print("Topic #%d:" % topic_idx)
print(" ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
print()
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)
My Question is this. Is there any component or matrix of the built model LDA, from where we can get the document-topic association ?
For example, I need to find top 2 topics associated with each doc as the document label / Category for that Doc. Is there any component to find distribution of topics in a document, similar to the model.components_
for finding words distribution within a topic.
You can compute the document-topic association using the transform(X) function of the LDA class.
On the example code, this would be:
doc_topic_distrib = lda.transform(tf)
with lda the fitted lda, and tf the input data you want to transform
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With