Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to interpret LDA components (using sklearn)?

I used Latent Dirichlet Allocation (sklearn implementation) to analyse about 500 scientific article-abstracts and I got topics containing most important words (in german language). My problem is to interpret these values associated with the most important words. I assumed to get probabilities for all words per topic which add up to 1, which is not the case.

How can I interpret these values? For example I would like to be able to tell why topic #20 has words with much higher values than other topics. Has their absolute height to do with Bayesian probability? Is the topic more common in the corpus? Im not yet able to bring together theses values with the math behind the LDA.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=1, top_words=stop_ger,
                                analyzer='word',
                                tokenizer = stemmer_sklearn.stem_ger())

tf = tf_vectorizer.fit_transform(texts)

n_topics = 10
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5, 
                                learning_method='online',                 
                                learning_offset=50., random_state=0)

lda.fit(tf)

def print_top_words(model, feature_names, n_top_words):
    for topic_id, topic in enumerate(model.components_):
        print('\nTopic Nr.%d:' % int(topic_id + 1)) 
        print(''.join([feature_names[i] + ' ' + str(round(topic[i], 2))
              +' | ' for i in topic.argsort()[:-n_top_words - 1:-1]]))

n_top_words = 4
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Topic Nr.1: demenzforsch 1.31 | fotus 1.21 | umwelteinfluss 1.16 | forschungsergebnis 1.04 |
Topic Nr.2: fur 1.47 | zwisch 0.94 | uber 0.81 | kontext 0.8 |
...
Topic Nr.20: werd 405.12 | fur 399.62 | sozial 212.31 | beitrag 177.95 | 
like image 473
LSz Avatar asked Feb 01 '16 20:02

LSz


People also ask

How do you evaluate LDA?

LDA is typically evaluated by either measuring perfor- mance on some secondary task, such as document clas- sification or information retrieval, or by estimating the probability of unseen held-out documents given some training documents.

What is the output of LDA model?

LDA ( short for Latent Dirichlet Allocation ) is an unsupervised machine-learning model that takes documents as input and finds topics as output. The model also says in what percentage each document talks about each topic. A topic is represented as a weighted list of words.

What is Max_iter in LDA?

Number of Topics: n_components is the number of topics to find from the corpus. The number of maximum iterations: max_iter: It is the number of maximum iterations allowed for the LDA algorithm to converge.


1 Answers

From the documentation

components_ Variational parameters for topic word distribution. Since the complete conditional for topic word distribution is a Dirichlet, components_[i, j] can be viewed as pseudocount that represents the number of times word j was assigned to topic i. It can also be viewed as distribution over the words for each topic after normalization: model.components_ / model.components_.sum(axis=1)[:, np.newaxis].

So the values can be seen as a distribution if you normalize over the component to evaluate the importance of each term in the topic. AFAIU you cannot use the pseudo-count to compare the importance of two topics in the corpus as they are a smoothing factor applied to the term-topic distribution.

like image 143
Simon Thordal Avatar answered Oct 04 '22 23:10

Simon Thordal