Topic distribution: How do we see which document belong to which topic after doing LDA in python

2 Answers

Using the probabilities of the topics, you can try to set some threshold and use it as a clustering baseline, but i am sure there are better ways to do clustering than this 'hacky' method.

from gensim import corpora, models, similarities from itertools import chain  """ DEMO """ documents = ["Human machine interface for lab abc computer applications",              "A survey of user opinion of computer system response time",              "The EPS user interface management system",              "System and human system engineering testing of EPS",              "Relation of user perceived response time to error measurement",              "The generation of random binary unordered trees",              "The intersection graph of paths in trees",              "Graph minors IV Widths of trees and well quasi ordering",              "Graph minors A survey"]  # remove common words and tokenize stoplist = set('for a of the and to in'.split()) texts = [[word for word in document.lower().split() if word not in stoplist]          for document in documents]  # remove words that appear only once all_tokens = sum(texts, []) tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1) texts = [[word for word in text if word not in tokens_once] for text in texts]  # Create Dictionary. id2word = corpora.Dictionary(texts) # Creates the Bag of Word corpus. mm = [id2word.doc2bow(text) for text in texts]  # Trains the LDA models. lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=3, \                                update_every=1, chunksize=10000, passes=1)  # Prints the topics. for top in lda.print_topics():   print top print  # Assigns the topics to the documents in corpus lda_corpus = lda[mm]  # Find the threshold, let's set the threshold to be 1/#clusters, # To prove that the threshold is sane, we average the sum of all probabilities: scores = list(chain(*[[score for topic_id,score in topic] \                       for topic in [doc for doc in lda_corpus]])) threshold = sum(scores)/len(scores) print threshold print  cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold] cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold] cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]  print cluster1 print cluster2 print cluster3

[out]:

0.131*trees + 0.121*graph + 0.119*system + 0.115*user + 0.098*survey + 0.082*interface + 0.080*eps + 0.064*minors + 0.056*response + 0.056*computer 0.171*time + 0.171*user + 0.170*response + 0.082*survey + 0.080*computer + 0.079*system + 0.050*trees + 0.042*graph + 0.040*minors + 0.040*human 0.155*system + 0.150*human + 0.110*graph + 0.107*minors + 0.094*trees + 0.090*eps + 0.088*computer + 0.087*interface + 0.040*survey + 0.028*user  0.333333333333  ['The EPS user interface management system', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors A survey'] ['A survey of user opinion of computer system response time', 'Relation of user perceived response time to error measurement'] ['Human machine interface for lab abc computer applications', 'System and human system engineering testing of EPS', 'Graph minors IV Widths of trees and well quasi ordering']

Just to make it clearer:

# Find the threshold, let's set the threshold to be 1/#clusters, # To prove that the threshold is sane, we average the sum of all probabilities: scores = [] for doc in lda_corpus     for topic in doc:         for topic_id, score in topic:             scores.append(score) threshold = sum(scores)/len(scores)

The above code is sum the score of all words and in all topics for all documents. Then normalize the sum by the number of scores.

answered Oct 02 '22 21:10

alvas

If you want to use the trick of

cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold] cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold] cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]

in the previous answer by alvas, make sure to set minimum_probability=0 in LdaModel

gensim.models.ldamodel.LdaModel(corpus,             num_topics=num_topics, id2word = dictionary,             passes=2, minimum_probability=0)

Otherwise the dimension of lda_corpus and documents may not agree since gensim will suppress any corpus with probability lower than minimum_probability.

An alternative way to group documents into topics is to assign topics according to the maximum probability

    lda_corpus = [max(prob,key=lambda y:y[1])                     for prob in lda[mm] ]     playlists = [[] for i in xrange(topic_num])]     for i, x in enumerate(lda_corpus):         playlists[x[0]].append(documents[i])

Note lda[mm] is roughly speaking a list of lists, or 2D matrix. The number of rows is the number of documents and the number of columns is the number of topics. Each matrix element is a tuple of the form (3,0.82) for example. Here 3 refers to the topic index and 0.82 the corresponding probability to be of that topic. By default, minimum_probability=0.01 and any tuple with probability less than 0.01 is omitted in lda[mm]. You can set it to be 1/#topics if you use the grouping method with maximum probability.

answered Oct 02 '22 22:10

nos

Related questions
                            
                                Control the pip version in virtualenv
                            
                                Mocking a subprocess call in Python
                            
                                How to set request args with Flask test_client?
                            
                                Difference between Dense and Activation layer in Keras
                            
                                Spectral Clustering a graph in python
                            
                                Unable to resolve " not a valid key=value pair (missing equal-sign) in Authorization header" when POSTing to api gateway
                            
                                Can I iterate over a class in Python?
                            
                                Creating a new function as return in python function?
                            
                                Compiling Python 3.4 is not copying pip
                            
                                Shipping Python modules in pyspark to other nodes
                            
                                Python for-loop without index and item
                            
                                How to map a function using multiple columns in pandas?
                            
                                Python nested context manager on multiple lines [duplicate]
                            
                                Python and Windows Named Pipes
                            
                                Truncating unicode so it fits a maximum size when encoded for wire transfer
                            
                                Multivariate spline interpolation in python/scipy?
                            
                                What is the equivalence in Python 3 of letters in Python 2?
                            
                                How do I see the Python doc on Linux?
                            
                                Setting SQLAlchemy autoincrement start value
                            
                                How to exclude mock package from python coverage report using nosetests

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Topic distribution: How do we see which document belong to which topic after doing LDA in python

Tags:

python

nltk

gensim

lda

jxn

People also ask

2 Answers

alvas

nos

Recent Activity

Donate For Us