I am able to run the LDA code from gensim and got the top 10 topics with their respective keywords.
Now I would like to go a step further to see how accurate the LDA algo is by seeing which document they cluster into each topic. Is this possible in gensim LDA?
Basically i would like to do something like this, but in python and using gensim.
LDA with topicmodels, how can I see which topics different documents belong to?
The results of an LDA give probability distributions for the topics over the vocabulary. In practice this means a list of words from the vocabulary, each with a probability associated with it. We can of course list the words in order of decreasing probability, and look at the top j words per topic for some j.
LDA is applied to the text data. It works by decomposing the corpus document word matrix (the larger matrix) into two parts (smaller matrices): the Document Topic Matrix and the Topic Word. Therefore, LDA like PCA is a matrix factorization technique.
Using the probabilities of the topics, you can try to set some threshold and use it as a clustering baseline, but i am sure there are better ways to do clustering than this 'hacky' method.
from gensim import corpora, models, similarities from itertools import chain """ DEMO """ documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"] # remove common words and tokenize stoplist = set('for a of the and to in'.split()) texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents] # remove words that appear only once all_tokens = sum(texts, []) tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1) texts = [[word for word in text if word not in tokens_once] for text in texts] # Create Dictionary. id2word = corpora.Dictionary(texts) # Creates the Bag of Word corpus. mm = [id2word.doc2bow(text) for text in texts] # Trains the LDA models. lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=3, \ update_every=1, chunksize=10000, passes=1) # Prints the topics. for top in lda.print_topics(): print top print # Assigns the topics to the documents in corpus lda_corpus = lda[mm] # Find the threshold, let's set the threshold to be 1/#clusters, # To prove that the threshold is sane, we average the sum of all probabilities: scores = list(chain(*[[score for topic_id,score in topic] \ for topic in [doc for doc in lda_corpus]])) threshold = sum(scores)/len(scores) print threshold print cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold] cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold] cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold] print cluster1 print cluster2 print cluster3
[out]
:
0.131*trees + 0.121*graph + 0.119*system + 0.115*user + 0.098*survey + 0.082*interface + 0.080*eps + 0.064*minors + 0.056*response + 0.056*computer 0.171*time + 0.171*user + 0.170*response + 0.082*survey + 0.080*computer + 0.079*system + 0.050*trees + 0.042*graph + 0.040*minors + 0.040*human 0.155*system + 0.150*human + 0.110*graph + 0.107*minors + 0.094*trees + 0.090*eps + 0.088*computer + 0.087*interface + 0.040*survey + 0.028*user 0.333333333333 ['The EPS user interface management system', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors A survey'] ['A survey of user opinion of computer system response time', 'Relation of user perceived response time to error measurement'] ['Human machine interface for lab abc computer applications', 'System and human system engineering testing of EPS', 'Graph minors IV Widths of trees and well quasi ordering']
Just to make it clearer:
# Find the threshold, let's set the threshold to be 1/#clusters, # To prove that the threshold is sane, we average the sum of all probabilities: scores = [] for doc in lda_corpus for topic in doc: for topic_id, score in topic: scores.append(score) threshold = sum(scores)/len(scores)
The above code is sum the score of all words and in all topics for all documents. Then normalize the sum by the number of scores.
If you want to use the trick of
cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold] cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold] cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]
in the previous answer by alvas, make sure to set minimum_probability=0 in LdaModel
gensim.models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=2, minimum_probability=0)
Otherwise the dimension of lda_corpus and documents may not agree since gensim will suppress any corpus with probability lower than minimum_probability.
An alternative way to group documents into topics is to assign topics according to the maximum probability
lda_corpus = [max(prob,key=lambda y:y[1]) for prob in lda[mm] ] playlists = [[] for i in xrange(topic_num])] for i, x in enumerate(lda_corpus): playlists[x[0]].append(documents[i])
Note lda[mm]
is roughly speaking a list of lists, or 2D matrix. The number of rows is the number of documents and the number of columns is the number of topics. Each matrix element is a tuple of the form (3,0.82)
for example. Here 3 refers to the topic index and 0.82 the corresponding probability to be of that topic. By default, minimum_probability=0.01
and any tuple with probability less than 0.01 is omitted in lda[mm]
. You can set it to be 1/#topics if you use the grouping method with maximum probability.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With