Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LDA model generates different topics everytime i train on the same corpus

I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics.

Why does the same LDA parameters and corpus generate different topics everytime?

And how do i stabilize the topic generation?

I'm using this corpus (http://pastebin.com/WptkKVF0) and this list of stopwords (http://pastebin.com/LL7dqLcj) and here's my code:

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
from collections import defaultdict
import codecs, os, glob, math

stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] != "#" and i != ""]

def generateTopics(corpus, dictionary):
    # Build LDA model using the above corpus
    lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
    corpus_lda = lda[corpus]

    # Group topics with similar words together.
    tops = set(lda.show_topics(50))
    top_clusters = []
    for l in tops:
        top = []
        for t in l.split(" + "):
            top.append((t.split("*")[0], t.split("*")[1]))
        top_clusters.append(top)

    # Generate word only topics
    top_wordonly = []
    for i in top_clusters:
        top_wordonly.append(":".join([j[1] for j in i]))

    return lda, corpus_lda, top_clusters, top_wordonly

####################################################################### 

# Read textfile, build dictionary and bag-of-words corpus
documents = []
for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):
    lemma = line.split("\t")[3]
    documents.append(lemma)
texts = [[word for word in document.lower().split() if word not in stopwords]
             for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)

for i in topic_wordonly:
    print i
like image 784
alvas Avatar asked Feb 25 '13 13:02

alvas


1 Answers

Why does the same LDA parameters and corpus generate different topics everytime?

Because LDA uses randomness in both training and inference steps.

And how do i stabilize the topic generation?

By resetting the numpy.random seed to the same value every time a model is trained or inference is performed, with numpy.random.seed:

SOME_FIXED_SEED = 42

# before training/inference:
np.random.seed(SOME_FIXED_SEED)

(This is ugly, and it makes Gensim results hard to reproduce; consider submitting a patch. I've already opened an issue.)

like image 127
Fred Foo Avatar answered Oct 08 '22 16:10

Fred Foo