Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mallet topic modeling - topic keys output parameter

In MALLET topic modelling, the --output-topic-keys [FILENAME] option outputs beside each topic a parameter that in the tutorial in the MALLET site called "Dirichlet parameter " of the topic.

I want to know what does this parameter represent? is it β in the LDA model? and if not what is it and what is it's meaning and use.

I noted that when I don't use the parameter optimization option while generating the topic model, this parameter differs in version 2.0.7 than in version 2.0.8. I want to know why this difference happens.

here's version 2.0.7 output

Version 2.0.7

and 2.0.8

enter image description here

I know that the output differs by each run, but I am only concerned with this parameter.

like image 225
Mahmoud Yusuf Avatar asked Jan 30 '23 23:01

Mahmoud Yusuf


1 Answers

The topic model inference algorithm used in Mallet involves repeatedly sampling new topic assignments for each word holding the assignments of all other words fixed. The factors that control this process are (1) how often the current word type appears in each topic and (2) how many times each topic appears in the current document. The smoothing parameters ensure that these values are never zero for any topic: beta for the first factor, alpha for the second.

You can think of the alpha parameter being displayed here as the number of "imaginary" words in each topic that are added. In the first case, topic 0 has 2.5 imaginary words of weight in every document. The default value for this parameter was initially 50 / numTopics. Larger values encourage models to have more uniform topic distributions in documents, smaller values encourage more sparsity. The general experience was that 50 was too large, and that 5 is a better default. This was changed in 2.0.8.

The default is to make the alpha weight equal for all topics. With hyperparameter optimization on, these values can vary. Usually what you will find is that a topic with a large value will contain "near stopwords" that are frequent in most documents and don't have much content. Topics with very small values are often unusual and distinctive documents. Topics in the middle are often the most interesting.

like image 150
David Mimno Avatar answered Mar 23 '23 04:03

David Mimno