Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LDA Topic Model Performance - Topic Coherence Implementation for scikit-learn

I have a question around measuring/calculating topic coherence for LDA models built in scikit-learn.

Topic Coherence is a useful metric for measuring the human interpretability of a given LDA topic model. Gensim's CoherenceModel allows Topic Coherence to be calculated for a given LDA model (several variants are included).

I am interested in leveraging scikit-learn's LDA rather than gensim's LDA for ease of use and documentation (note: I would like to avoid using the gensim to scikit-learn wrapper i.e. actually leverage sklearn’s LDA). From my research, there is seemingly no scikit-learn equivalent to Gensim’s CoherenceModel.

Is there a way to either:

1 - Feed scikit-learn’s LDA model into gensim’s CoherenceModel pipeline, either through manually converting the scikit-learn model into gensim format or through a scikit-learn to gensim wrapper (I have seen the wrapper the other way around) to generate Topic Coherence?

Or

2 - Manually calculate topic coherence from scikit-learn’s LDA model and CountVectorizer/Tfidf matrices?

I have done quite a bit of research on implementations for this use case online but haven’t seen any solutions. The only leads I have are the documented equations from scientific literature.

If anyone has any knowledge on any similar implementations, or if you could point me in the right direction for creating a manual method for this, that would be great. Thank you!

*Side note: I understand that perplexity and log-likelihood are available in scikit-learn for performance measurements, but these are not as predictive from what I have read.

like image 873
learning-new-things-guy Avatar asked Aug 30 '18 18:08

learning-new-things-guy


People also ask

What is coherence LDA?

What is topic coherence? Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference.

What is LDA topic modeling?

Latent Dirichlet Allocation (LDA) is a popular topic modeling technique to extract topics from a given corpus. The term latent conveys something that exists but is not yet developed. In other words, latent means hidden or concealed. Now, the topics that we want to extract from the data are also “hidden topics”.

What is a good topic coherence score?

There is no one way to determine whether the coherence score is good or bad. The score and its value depend on the data that it's calculated from. For instance, in one case, the score of 0.5 might be good enough but in another case not acceptable. The only rule is that we want to maximize this score.


1 Answers

Feed scikit-learn’s LDA model into gensim’s CoherenceModel pipeline

As far as I know, there is no "easy way" to do this. You would have to manually reformat the sklearn data structures to be compatible with gensim. I haven't attempted this myself, but this strikes me as an unnecessary step that might take a long time. There is an old Python 2.7 attempt at a gensim-sklearn-wrapper which you might want to look at, but it seems deprecated - maybe you can get some information/inspiration from that.

Manually calculate topic coherence from scikit-learn’s LDA model and CountVectorizer/Tfidf matrices?

The summing-up of vectors you need can be easily achieved with a loop. You can find code samples for a "manual" coherence calculation for NMF. Calculation depends on the specific measure, of course, but sklearn should return you the data you need for the analysis pretty easily.

Resources

It is unclear to me why you would categorically exclude gensim - the topic coherence pipeline is pretty extensive, and documentation exists.

See, for example, these three tutorials (in Jupyter notebooks).

  • Demonstration of the topic coherence pipeline in Gensim
  • Performing Model Selection Using Topic Coherence
  • Benchmark testing of coherence pipeline on Movies dataset

The formulas for several coherence measures can be found in this paper here.

like image 77
jhl Avatar answered Sep 24 '22 11:09

jhl