I can't seem to find it or probably my knowledge on statistics and its terms are the problem here but I want to achieve something similar to the graph found on the bottom page of the LDA lib from PyPI and observe the uniformity/convergence of the lines. How can I achieve this with Gensim LDA?
chunksize - number of documents to consider at once (affects the memory consumption) update_every - update the model every update_every chunksize chunks (essentially, this is for memory consumption optimization) passes - how many times the algorithm is supposed to pass over the whole corpus.
Passes is the number of times you want to go through the entire corpus. Below are a few examples of different combinations of the 3 parameters and the number of online training updates which will occur while training LDA.
How to find optimum number of topics ? One approach to find optimum number of topics is build many LDA models with different values of number of topics and pick the one that gives highest coherence value. If you see the same keywords being repeated in multiple topics, it's probably a sign that the 'k' is too large.
To use MALLET LDA, we'll need to fit and transform the data using the vectorizer and create some variables that the model needs. First, we fit the CountVectorizer to our negative reviews. Then, we create a document-word matrix and convert it from a sparse matrix into a gensim word corpus.
You are right to wish to plot the convergence of your model fitting. Gensim unfortunately does not seem to make this very straight forward.
Run the model in such a way that you will be able to analyze the output of the model fitting function. I like to setup a log file.
import logging
logging.basicConfig(filename='gensim.log',
format="%(asctime)s:%(levelname)s:%(message)s",
level=logging.INFO)
Set the eval_every
parameter in LdaModel
. The lower this value is the better resolution your plot will have. However, computing the perplexity can slow down your fit a lot!
lda_model =
LdaModel(corpus=corpus,
id2word=id2word,
num_topics=30,
eval_every=10,
pass=40,
iterations=5000)
Parse the log file and make your plot.
import re
import matplotlib.pyplot as plt
p = re.compile("(-*\d+\.\d+) per-word .* (\d+\.\d+) perplexity")
matches = [p.findall(l) for l in open('gensim.log')]
matches = [m for m in matches if len(m) > 0]
tuples = [t[0] for t in matches]
perplexity = [float(t[1]) for t in tuples]
liklihood = [float(t[0]) for t in tuples]
iter = list(range(0,len(tuples)*10,10))
plt.plot(iter,liklihood,c="black")
plt.ylabel("log liklihood")
plt.xlabel("iteration")
plt.title("Topic Model Convergence")
plt.grid()
plt.savefig("convergence_liklihood.pdf")
plt.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With