Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to monitor convergence of Gensim LDA model?

I can't seem to find it or probably my knowledge on statistics and its terms are the problem here but I want to achieve something similar to the graph found on the bottom page of the LDA lib from PyPI and observe the uniformity/convergence of the lines. How can I achieve this with Gensim LDA?

like image 859
ZeferiniX Avatar asked Jun 01 '16 13:06

ZeferiniX


People also ask

What is Chunksize in LDA?

chunksize - number of documents to consider at once (affects the memory consumption) update_every - update the model every update_every chunksize chunks (essentially, this is for memory consumption optimization) passes - how many times the algorithm is supposed to pass over the whole corpus.

What is passes in LDA Gensim?

Passes is the number of times you want to go through the entire corpus. Below are a few examples of different combinations of the 3 parameters and the number of online training updates which will occur while training LDA.

What is the optimal number of topics for LDA in Python?

How to find optimum number of topics ? One approach to find optimum number of topics is build many LDA models with different values of number of topics and pick the one that gives highest coherence value. If you see the same keywords being repeated in multiple topics, it's probably a sign that the 'k' is too large.

How do you use mallet LDA?

To use MALLET LDA, we'll need to fit and transform the data using the vectorizer and create some variables that the model needs. First, we fit the CountVectorizer to our negative reviews. Then, we create a document-word matrix and convert it from a sparse matrix into a gensim word corpus.


1 Answers

You are right to wish to plot the convergence of your model fitting. Gensim unfortunately does not seem to make this very straight forward.

  1. Run the model in such a way that you will be able to analyze the output of the model fitting function. I like to setup a log file.

    import logging
    logging.basicConfig(filename='gensim.log',
                        format="%(asctime)s:%(levelname)s:%(message)s",
                        level=logging.INFO)
    
  2. Set the eval_every parameter in LdaModel. The lower this value is the better resolution your plot will have. However, computing the perplexity can slow down your fit a lot!

    lda_model = 
    LdaModel(corpus=corpus,
             id2word=id2word,
             num_topics=30,
             eval_every=10,
             pass=40,
             iterations=5000)
    
  3. Parse the log file and make your plot.

    import re
    import matplotlib.pyplot as plt
    p = re.compile("(-*\d+\.\d+) per-word .* (\d+\.\d+) perplexity")
    matches = [p.findall(l) for l in open('gensim.log')]
    matches = [m for m in matches if len(m) > 0]
    tuples = [t[0] for t in matches]
    perplexity = [float(t[1]) for t in tuples]
    liklihood = [float(t[0]) for t in tuples]
    iter = list(range(0,len(tuples)*10,10))
    plt.plot(iter,liklihood,c="black")
    plt.ylabel("log liklihood")
    plt.xlabel("iteration")
    plt.title("Topic Model Convergence")
    plt.grid()
    plt.savefig("convergence_liklihood.pdf")
    plt.close()
    
like image 141
groceryheist Avatar answered Sep 22 '22 19:09

groceryheist