Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dynamic number of topics in topic models

I am new to topic modelling. My aim is to find key topics from a document. I am planning to use lda for the purpose. But in lda the number of topics should be predefined.I believe if a document from some other domain which was not in the training corpus comes,it will not give proper results. Is there any alternative solution? Is my thought is correct?

like image 949
Jishad AV Avatar asked Oct 17 '25 03:10

Jishad AV


1 Answers

Two good candidates for learning the topics are Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Process (HDP) topic models.

For LDA, the number of topics K is fixed and assumed to be known ahead of time. Fast inference algorithms, such as on-line Variational Bayes (VB) algorithm implemented in scikit and gensim enable training on very large data sets (e.g. New York Times or Wikipedia) By training on large corpora and setting K high, we can avoid the problem of over-fitting and learn meaningful topics for out of sample documents. For LDA, cross-validation is commonly used to set K by evaluating perplexity for different number of topics and choosing K that minimizes perplexity.

Alternatively, HDP topic model (implemented in gensim) learns the number of topics from data automatically. By setting the concentration parameters and the truncation levels, the number of topics is inferred by the model. Efficient inference algorithms such as online variational inference for HDPs enable training on massive datasets and discovery of meaningful topics.

like image 62
Vadim Smolyakov Avatar answered Oct 19 '25 09:10

Vadim Smolyakov