Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gensim LDA topic assignment

I am hoping to assign each document to one topic using LDA. Now I realise that what you get is a distribution over topics from LDA. However as you see from the last line below I assign it to the most probable topic.

My question is this. I have to run lda[corpus] for somewhat the second time in order to get these topics. Is there some other builtin gensim function that will give me this topic assignment vectors directly? Especially since the LDA algorithm has passed through the documents it might have saved these topic assignments?

    # Get the Dictionary and BoW of the corpus after some stemming/ cleansing
    texts = [[stem(word) for word in document.split() if word not in STOPWORDS] for document in cleanDF.text.values]
    dictionary = corpora.Dictionary(texts)
    dictionary.filter_extremes(no_below=5, no_above=0.9)
    corpus = [dictionary.doc2bow(text) for text in texts]

    # The actual LDA component
    lda = models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=30, chunksize=10000, passes=10,workers=4) 

    # Assign each document to most prevalent topic
    lda_topic_assignment = [max(p,key=lambda item: item[1]) for p in lda[corpus]]
like image 426
sachinruk Avatar asked Oct 11 '16 03:10

sachinruk


Video Answer


2 Answers

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]


test =LDA[corpus[0]]
print(test)
sorted(test, reverse=True, key=lambda x: x[1])

Topics = ['Topic_'+str(sorted(LDA[i], reverse=True, key=lambda x: x[1])[0][0]).zfill(3) for i in corpus]
like image 165
DongHwa Kim Avatar answered Oct 26 '22 16:10

DongHwa Kim


There is no other builtin Gensim function that will give the topic assignment vectors directly.

Your question is valid that LDA algorithm has passed through the documents but implementation of LDA is working by updating the model in chunks (based on value of chunksize parameter), hence it will not keep the entire corpus in-memory.

Hence you have to use lda[corpus] or use the method lda.get_document_topics()

like image 22
Venkatachalam Avatar answered Oct 26 '22 16:10

Venkatachalam