Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Topic distribution: How do we see which document belong to which topic after doing LDA in python

I am able to run the LDA code from gensim and got the top 10 topics with their respective keywords.

Now I would like to go a step further to see how accurate the LDA algo is by seeing which document they cluster into each topic. Is this possible in gensim LDA?

Basically i would like to do something like this, but in python and using gensim.

LDA with topicmodels, how can I see which topics different documents belong to?

like image 784
jxn Avatar asked Jan 08 '14 00:01

jxn


People also ask

What is topic distribution in LDA?

The results of an LDA give probability distributions for the topics over the vocabulary. In practice this means a list of words from the vocabulary, each with a probability associated with it. We can of course list the words in order of decreasing probability, and look at the top j words per topic for some j.

How LDA in topic modeling represents the documents and words of the text?

LDA is applied to the text data. It works by decomposing the corpus document word matrix (the larger matrix) into two parts (smaller matrices): the Document Topic Matrix and the Topic Word. Therefore, LDA like PCA is a matrix factorization technique.


2 Answers

Using the probabilities of the topics, you can try to set some threshold and use it as a clustering baseline, but i am sure there are better ways to do clustering than this 'hacky' method.

from gensim import corpora, models, similarities from itertools import chain  """ DEMO """ documents = ["Human machine interface for lab abc computer applications",              "A survey of user opinion of computer system response time",              "The EPS user interface management system",              "System and human system engineering testing of EPS",              "Relation of user perceived response time to error measurement",              "The generation of random binary unordered trees",              "The intersection graph of paths in trees",              "Graph minors IV Widths of trees and well quasi ordering",              "Graph minors A survey"]  # remove common words and tokenize stoplist = set('for a of the and to in'.split()) texts = [[word for word in document.lower().split() if word not in stoplist]          for document in documents]  # remove words that appear only once all_tokens = sum(texts, []) tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1) texts = [[word for word in text if word not in tokens_once] for text in texts]  # Create Dictionary. id2word = corpora.Dictionary(texts) # Creates the Bag of Word corpus. mm = [id2word.doc2bow(text) for text in texts]  # Trains the LDA models. lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=3, \                                update_every=1, chunksize=10000, passes=1)  # Prints the topics. for top in lda.print_topics():   print top print  # Assigns the topics to the documents in corpus lda_corpus = lda[mm]  # Find the threshold, let's set the threshold to be 1/#clusters, # To prove that the threshold is sane, we average the sum of all probabilities: scores = list(chain(*[[score for topic_id,score in topic] \                       for topic in [doc for doc in lda_corpus]])) threshold = sum(scores)/len(scores) print threshold print  cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold] cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold] cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]  print cluster1 print cluster2 print cluster3 

[out]:

0.131*trees + 0.121*graph + 0.119*system + 0.115*user + 0.098*survey + 0.082*interface + 0.080*eps + 0.064*minors + 0.056*response + 0.056*computer 0.171*time + 0.171*user + 0.170*response + 0.082*survey + 0.080*computer + 0.079*system + 0.050*trees + 0.042*graph + 0.040*minors + 0.040*human 0.155*system + 0.150*human + 0.110*graph + 0.107*minors + 0.094*trees + 0.090*eps + 0.088*computer + 0.087*interface + 0.040*survey + 0.028*user  0.333333333333  ['The EPS user interface management system', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors A survey'] ['A survey of user opinion of computer system response time', 'Relation of user perceived response time to error measurement'] ['Human machine interface for lab abc computer applications', 'System and human system engineering testing of EPS', 'Graph minors IV Widths of trees and well quasi ordering'] 

Just to make it clearer:

# Find the threshold, let's set the threshold to be 1/#clusters, # To prove that the threshold is sane, we average the sum of all probabilities: scores = [] for doc in lda_corpus     for topic in doc:         for topic_id, score in topic:             scores.append(score) threshold = sum(scores)/len(scores) 

The above code is sum the score of all words and in all topics for all documents. Then normalize the sum by the number of scores.

like image 93
alvas Avatar answered Oct 02 '22 21:10

alvas


If you want to use the trick of

cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold] cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold] cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold] 

in the previous answer by alvas, make sure to set minimum_probability=0 in LdaModel

gensim.models.ldamodel.LdaModel(corpus,             num_topics=num_topics, id2word = dictionary,             passes=2, minimum_probability=0) 

Otherwise the dimension of lda_corpus and documents may not agree since gensim will suppress any corpus with probability lower than minimum_probability.

An alternative way to group documents into topics is to assign topics according to the maximum probability

    lda_corpus = [max(prob,key=lambda y:y[1])                     for prob in lda[mm] ]     playlists = [[] for i in xrange(topic_num])]     for i, x in enumerate(lda_corpus):         playlists[x[0]].append(documents[i]) 

Note lda[mm] is roughly speaking a list of lists, or 2D matrix. The number of rows is the number of documents and the number of columns is the number of topics. Each matrix element is a tuple of the form (3,0.82) for example. Here 3 refers to the topic index and 0.82 the corresponding probability to be of that topic. By default, minimum_probability=0.01 and any tuple with probability less than 0.01 is omitted in lda[mm]. You can set it to be 1/#topics if you use the grouping method with maximum probability.

like image 39
nos Avatar answered Oct 02 '22 22:10

nos