Evaluation of topic modeling: How to understand a coherence value / c_v of 0.4, is it good or bad? [closed]

2 Answers

Coherence measures the relative distance between words within a topic. There are two major types C_V typically 0 < x < 1 and uMass -14 < x < 14. It's rare to see a coherence of 1 or +.9 unless the words being measured are either identical words or bigrams. Like United and States would likely return a coherence score of ~.94 or hero and hero would return a coherence of 1. The overall coherence score of a topic is the average of the distances between words. I try and attain a .7 in my LDAs if I'm using c_v I think that is a strong topic correlation. I would say:

.3 is bad

.4 is low

.55 is okay

.65 might be as good as it is going to get

.7 is nice

.8 is unlikely and

.9 is probably wrong

Low coherence fixes:

adjust your parameters alpha = .1, beta = .01 or .001, random_state = 123, etc
get better data
at .4 you probably have the wrong number of topics check out https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/ for what is known as the elbow method - it gives you a graph of the optimal number of topics for greatest coherence in your data set. I'm using mallet which has pretty good coherance here is code to check coherence for different numbers of topics:

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values
# Can take a long time to run.
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=2, limit=40, step=6)
# Show graph
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))
    
# Select the model and print the topics
optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))

I hope this helps :)

163

answered Oct 21 '22 06:10

Sara

In addition to the excellent answer from Sara:

UMass coherence measure how often were the two words (Wi, Wj) were seen together in the corpus. It is defined as:

D(Wi, Wj) = log [ (D(Wi, Wj) + EPSILON) / D(Wi) ]

Where: D(Wi, Wj) is how many times word Wi and word Wj appeared together

D(Wi) is how many times word Wi appeared alone in the corpus

EPSILON is a small value (like 1e-12) added to the numerator to avoid 0 values

If Wi and Wj never appear together, then this results in log(0) which will break the universe. EPSILON value is kind-of a hack to fix this.

In conclusion, you can get a value from very big negative number all the way till approx 0. Interpretation is the same as Sara wrote, the greater the number the better, where 0 would be obviously wrong.

answered Oct 21 '22 05:10

Muhammad Ali

Related questions
                            
                                How to optimize MAPE code in Python?
                            
                                Scoring in Gridsearch CV
                            
                                What is the difference between inferential analysis and predictive analysis?
                            
                                Error importing auto_arima from pyramid
                            
                                Alternative to Mayavi for scientific 3d plotting
                            
                                Recommendation engine without ratings
                            
                                cv2.approxPolyDP() , cv2.arcLength() How these works
                            
                                Randomly split up elements from a stream of data without knowing the total number of elements
                            
                                Inverse of numpy.gradient function
                            
                                Combining heuristics when ranking social network news feed items
                            
                                Round to nearest 1000 in pandas
                            
                                How to transform some columns only with SimpleImputer or equivalent
                            
                                How to plot multiple graphs in one chart using pygal?
                            
                                Cannot import category_encoders module
                            
                                Kubernetes can analytical jobs be chained together in a workflow?
                            
                                How to change the y axis to display percent (%) in Python Plotnine barplot?
                            
                                Implementing custom loss function in scikit learn
                            
                                TypeError: __call__() missing 1 required positional argument: 'inputs'
                            
                                Understanding FeatureHasher, collisions and vector size trade-off
                            
                                Best way to subset a pandas dataframe [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Evaluation of topic modeling: How to understand a coherence value / c_v of 0.4, is it good or bad? [closed]

Tags:

data-science

lda

topic-modeling

User Mohamed

People also ask

2 Answers

Sara

Muhammad Ali

Recent Activity

Donate For Us