LDA Topic Model Performance - Topic Coherence Implementation for scikit-learn

Tags:

I have a question around measuring/calculating topic coherence for LDA models built in scikit-learn.

Topic Coherence is a useful metric for measuring the human interpretability of a given LDA topic model. Gensim's CoherenceModel allows Topic Coherence to be calculated for a given LDA model (several variants are included).

I am interested in leveraging scikit-learn's LDA rather than gensim's LDA for ease of use and documentation (note: I would like to avoid using the gensim to scikit-learn wrapper i.e. actually leverage sklearn’s LDA). From my research, there is seemingly no scikit-learn equivalent to Gensim’s CoherenceModel.

Is there a way to either:

1 - Feed scikit-learn’s LDA model into gensim’s CoherenceModel pipeline, either through manually converting the scikit-learn model into gensim format or through a scikit-learn to gensim wrapper (I have seen the wrapper the other way around) to generate Topic Coherence?

2 - Manually calculate topic coherence from scikit-learn’s LDA model and CountVectorizer/Tfidf matrices?

I have done quite a bit of research on implementations for this use case online but haven’t seen any solutions. The only leads I have are the documented equations from scientific literature.

If anyone has any knowledge on any similar implementations, or if you could point me in the right direction for creating a manual method for this, that would be great. Thank you!

*Side note: I understand that perplexity and log-likelihood are available in scikit-learn for performance measurements, but these are not as predictive from what I have read.

873

asked Aug 30 '18 18:08

learning-new-things-guy

1 Answers

Feed scikit-learn’s LDA model into gensim’s CoherenceModel pipeline

As far as I know, there is no "easy way" to do this. You would have to manually reformat the sklearn data structures to be compatible with gensim. I haven't attempted this myself, but this strikes me as an unnecessary step that might take a long time. There is an old Python 2.7 attempt at a gensim-sklearn-wrapper which you might want to look at, but it seems deprecated - maybe you can get some information/inspiration from that.

Manually calculate topic coherence from scikit-learn’s LDA model and CountVectorizer/Tfidf matrices?

The summing-up of vectors you need can be easily achieved with a loop. You can find code samples for a "manual" coherence calculation for NMF. Calculation depends on the specific measure, of course, but sklearn should return you the data you need for the analysis pretty easily.

Resources

It is unclear to me why you would categorically exclude gensim - the topic coherence pipeline is pretty extensive, and documentation exists.

See, for example, these three tutorials (in Jupyter notebooks).

Demonstration of the topic coherence pipeline in Gensim
Performing Model Selection Using Topic Coherence
Benchmark testing of coherence pipeline on Movies dataset

The formulas for several coherence measures can be found in this paper here.

answered Sep 24 '22 11:09

jhl

Related questions
                            
                                incompatibility issue between scikit-learn 0.24.1 and scikit-optimize 0.8.1
                            
                                Large data serialization on scikit-learn with Python 3
                            
                                Use one attribute only once in scikit-learn decision tree in python
                            
                                understanding scikit learn Random Forest memory requirement for prediction
                            
                                How do I properly combine numerical features with text (bag of words) in scikit-learn?
                            
                                Python Deployment Package with SKLEARN, PANDAS and NUMPY issue?
                            
                                Are predictions on scikit-learn models thread-safe?
                            
                                ValueError: negative dimensions are not allowed
                            
                                Different versions of sklearn give quite different training results
                            
                                how to serve pytorch or sklearn models using tensorflow serving
                            
                                Adding StandardScaler() of values as new column to DataFrame returns partly NaNs
                            
                                Scikit - Combining scale and grid search
                            
                                Auto-sklearn installation error
                            
                                getting transformer results from sklearn.pipeline.Pipeline
                            
                                Why GridSearchCV in scikit-learn spawn so many threads
                            
                                Interpreting logistic regression feature coefficient values in sklearn

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

LDA Topic Model Performance - Topic Coherence Implementation for scikit-learn

Tags:

nlp

scikit-learn

gensim

lda

topic-modeling

learning-new-things-guy

People also ask

1 Answers

jhl

Recent Activity

Donate For Us