Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Evaluation of topic modeling: How to understand a coherence value / c_v of 0.4, is it good or bad? [closed]

I need to know whether coherence score of 0.4 is good or bad? I use LDA as topic modelling algorithm.

What is the average coherence score in this context?

like image 390
User Mohamed Avatar asked Feb 19 '19 09:02

User Mohamed


People also ask

What is a good topic coherence score?

There is no one way to determine whether the coherence score is good or bad. The score and its value depend on the data that it's calculated from. For instance, in one case, the score of 0.5 might be good enough but in another case not acceptable. The only rule is that we want to maximize this score.

What is coherence measure?

Coherence measures have been proposed in the NLP community to evaluate topics constructed by some topic model. In a more general setting, coherence measures have been discussed in scientific philosophy as a formalism to quantify the hanging and fitting together of information pieces [3].


2 Answers

Coherence measures the relative distance between words within a topic. There are two major types C_V typically 0 < x < 1 and uMass -14 < x < 14. It's rare to see a coherence of 1 or +.9 unless the words being measured are either identical words or bigrams. Like United and States would likely return a coherence score of ~.94 or hero and hero would return a coherence of 1. The overall coherence score of a topic is the average of the distances between words. I try and attain a .7 in my LDAs if I'm using c_v I think that is a strong topic correlation. I would say:

  • .3 is bad

    .4 is low

    .55 is okay

    .65 might be as good as it is going to get

    .7 is nice

    .8 is unlikely and

    .9 is probably wrong

Low coherence fixes:

  • adjust your parameters alpha = .1, beta = .01 or .001, random_state = 123, etc

  • get better data

  • at .4 you probably have the wrong number of topics check out https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/ for what is known as the elbow method - it gives you a graph of the optimal number of topics for greatest coherence in your data set. I'm using mallet which has pretty good coherance here is code to check coherence for different numbers of topics:

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values
# Can take a long time to run.
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=2, limit=40, step=6)
# Show graph
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))
    
# Select the model and print the topics
optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))

I hope this helps :)

like image 163
Sara Avatar answered Oct 21 '22 06:10

Sara


In addition to the excellent answer from Sara:

UMass coherence measure how often were the two words (Wi, Wj) were seen together in the corpus. It is defined as:

D(Wi, Wj) = log [ (D(Wi, Wj) + EPSILON) / D(Wi) ]

Where: D(Wi, Wj) is how many times word Wi and word Wj appeared together

D(Wi) is how many times word Wi appeared alone in the corpus

EPSILON is a small value (like 1e-12) added to the numerator to avoid 0 values

If Wi and Wj never appear together, then this results in log(0) which will break the universe. EPSILON value is kind-of a hack to fix this.

In conclusion, you can get a value from very big negative number all the way till approx 0. Interpretation is the same as Sara wrote, the greater the number the better, where 0 would be obviously wrong.

like image 4
Muhammad Ali Avatar answered Oct 21 '22 05:10

Muhammad Ali