I used Latent Dirichlet Allocation (sklearn implementation) to analyse about 500 scientific article-abstracts and I got topics containing most important words (in german language). My problem is to interpret these values associated with the most important words. I assumed to get probabilities for all words per topic which add up to 1, which is not the case.
How can I interpret these values? For example I would like to be able to tell why topic #20 has words with much higher values than other topics. Has their absolute height to do with Bayesian probability? Is the topic more common in the corpus? Im not yet able to bring together theses values with the math behind the LDA.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=1, top_words=stop_ger,
analyzer='word',
tokenizer = stemmer_sklearn.stem_ger())
tf = tf_vectorizer.fit_transform(texts)
n_topics = 10
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
learning_method='online',
learning_offset=50., random_state=0)
lda.fit(tf)
def print_top_words(model, feature_names, n_top_words):
for topic_id, topic in enumerate(model.components_):
print('\nTopic Nr.%d:' % int(topic_id + 1))
print(''.join([feature_names[i] + ' ' + str(round(topic[i], 2))
+' | ' for i in topic.argsort()[:-n_top_words - 1:-1]]))
n_top_words = 4
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)
Topic Nr.1: demenzforsch 1.31 | fotus 1.21 | umwelteinfluss 1.16 | forschungsergebnis 1.04 |
Topic Nr.2: fur 1.47 | zwisch 0.94 | uber 0.81 | kontext 0.8 |
...
Topic Nr.20: werd 405.12 | fur 399.62 | sozial 212.31 | beitrag 177.95 |
LDA is typically evaluated by either measuring perfor- mance on some secondary task, such as document clas- sification or information retrieval, or by estimating the probability of unseen held-out documents given some training documents.
LDA ( short for Latent Dirichlet Allocation ) is an unsupervised machine-learning model that takes documents as input and finds topics as output. The model also says in what percentage each document talks about each topic. A topic is represented as a weighted list of words.
Number of Topics: n_components is the number of topics to find from the corpus. The number of maximum iterations: max_iter: It is the number of maximum iterations allowed for the LDA algorithm to converge.
From the documentation
components_ Variational parameters for topic word distribution. Since the complete conditional for topic word distribution is a Dirichlet, components_[i, j] can be viewed as pseudocount that represents the number of times word j was assigned to topic i. It can also be viewed as distribution over the words for each topic after normalization:
model.components_ / model.components_.sum(axis=1)[:, np.newaxis]
.
So the values can be seen as a distribution if you normalize over the component to evaluate the importance of each term in the topic. AFAIU you cannot use the pseudo-count to compare the importance of two topics in the corpus as they are a smoothing factor applied to the term-topic distribution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With