How to interpret LDA components (using sklearn)?

Tags:

I used Latent Dirichlet Allocation (sklearn implementation) to analyse about 500 scientific article-abstracts and I got topics containing most important words (in german language). My problem is to interpret these values associated with the most important words. I assumed to get probabilities for all words per topic which add up to 1, which is not the case.

How can I interpret these values? For example I would like to be able to tell why topic #20 has words with much higher values than other topics. Has their absolute height to do with Bayesian probability? Is the topic more common in the corpus? Im not yet able to bring together theses values with the math behind the LDA.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=1, top_words=stop_ger,
                                analyzer='word',
                                tokenizer = stemmer_sklearn.stem_ger())

tf = tf_vectorizer.fit_transform(texts)

n_topics = 10
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5, 
                                learning_method='online',                 
                                learning_offset=50., random_state=0)

lda.fit(tf)

def print_top_words(model, feature_names, n_top_words):
    for topic_id, topic in enumerate(model.components_):
        print('\nTopic Nr.%d:' % int(topic_id + 1)) 
        print(''.join([feature_names[i] + ' ' + str(round(topic[i], 2))
              +' | ' for i in topic.argsort()[:-n_top_words - 1:-1]]))

n_top_words = 4
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Topic Nr.1: demenzforsch 1.31 | fotus 1.21 | umwelteinfluss 1.16 | forschungsergebnis 1.04 |
Topic Nr.2: fur 1.47 | zwisch 0.94 | uber 0.81 | kontext 0.8 |
...
Topic Nr.20: werd 405.12 | fur 399.62 | sozial 212.31 | beitrag 177.95 |

473

asked Feb 01 '16 20:02

LSz

1 Answers

From the documentation

components_ Variational parameters for topic word distribution. Since the complete conditional for topic word distribution is a Dirichlet, components_[i, j] can be viewed as pseudocount that represents the number of times word j was assigned to topic i. It can also be viewed as distribution over the words for each topic after normalization: model.components_ / model.components_.sum(axis=1)[:, np.newaxis].

So the values can be seen as a distribution if you normalize over the component to evaluate the importance of each term in the topic. AFAIU you cannot use the pseudo-count to compare the importance of two topics in the corpus as they are a smoothing factor applied to the term-topic distribution.

143

answered Oct 04 '22 23:10

Simon Thordal

Related questions
                            
                                How can I split a string of a mathematical expressions in python?
                            
                                Using python to run another program?
                            
                                Django mysqlclient install
                            
                                Object of type 'AuthToken' is not JSON serializable
                            
                                Basic Flask app not running (TypeError: required field "type_ignores" missing from Module)
                            
                                -bash: /usr/bin/virtualenvwrapper.sh: No such file or directory
                            
                                Python: can't assign to literal
                            
                                How to convert a list of multiple integers into a single integer?
                            
                                Efficient calculation of euclidean distance
                            
                                How to convert requests.cookiejar to qnetworkcookiejar?
                            
                                How to distribute files in a Python sdist that are not VCS tracked?
                            
                                Is it possible to prioritise a lock?
                            
                                Is there a python linter that checks types according to type hints?
                            
                                Read and reverse data chunk by chunk from a csv file and copy to a new csv file
                            
                                Add multiple text labels from DataFrame columns in Plotly
                            
                                Setting Django admin display times to local time?
                            
                                Slicing behavior of python range()[:]
                            
                                sys.setswitchinterval in Python 3.2 and beyond
                            
                                C Python: Running Python code within a context

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to interpret LDA components (using sklearn)?

Tags:

python-3.x

scikit-learn

lda

topic-modeling

LSz

People also ask

1 Answers

Simon Thordal

Recent Activity

Donate For Us