I'm relative new in the world of Latent Dirichlet Allocation. I am able to generate a LDA Model following the Wikipedia tutorial and I'm able to generate a LDA model with my own documents. My step now is try understand how can I use a previus generated model to classify unseen documents. I'm saving my "lda_wiki_model" with <pre class="prettyprint"><code>id2word =gensim.corpora.Dictionary.load_from_text('ptwiki_wordids.txt.bz2') mm = gensim.corpora.MmCorpus('ptwiki_tfidf.mm') lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1) lda.save('lda_wiki_model.lda') </code></pre> And I'm loading the same model with: <pre class="prettyprint"><code>new_lda = gensim.models.LdaModel.load(path + 'lda_wiki_model.lda') #carrega o modelo </code></pre> I have a "new_doc.txt", and I turn my document into a id<-> term dictionary and converted this tokenized document to "document-term matrix" But when I run <code>new_topics = new_lda[corpus]</code> I receive a 'gensim.interfaces.TransformedCorpus object at 0x7f0ecfa69d50' how can I extract topics from that? I already tried <pre class="prettyprint"><code>`lsa = models.LdaModel(new_topics, id2word=dictionary, num_topics=1, passes=2) corpus_lda = lsa[new_topics] print(lsa.print_topics(num_topics=1, num_words=7) </code></pre> and <code>print(corpus_lda.print_topics(num_topics=1, num_words=7</code>) ` but that return topics not relationed to my new document. Where is my mistake? I'm miss understanding something? **If a run a new model using the dictionary and corpus created above, I receive the correct topics, my point is: how re-use my model? is correctly re-use that wiki_model? Thank you.

I was facing the same problem. This code will solve your problem: <pre class="prettyprint"><code>new_topics = new_lda[corpus] for topic in new_topics: print(topic) </code></pre> This will give you a list of tuples of form (topic number, probability)

gensim.interfaces.TransformedCorpus - How use?

Tags:

gensim

lda

I'm relative new in the world of Latent Dirichlet Allocation. I am able to generate a LDA Model following the Wikipedia tutorial and I'm able to generate a LDA model with my own documents. My step now is try understand how can I use a previus generated model to classify unseen documents. I'm saving my "lda_wiki_model" with

id2word =gensim.corpora.Dictionary.load_from_text('ptwiki_wordids.txt.bz2')

    mm = gensim.corpora.MmCorpus('ptwiki_tfidf.mm')

    lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)
    lda.save('lda_wiki_model.lda')

And I'm loading the same model with:

new_lda = gensim.models.LdaModel.load(path + 'lda_wiki_model.lda') #carrega o modelo

I have a "new_doc.txt", and I turn my document into a id<-> term dictionary and converted this tokenized document to "document-term matrix"

But when I run new_topics = new_lda[corpus] I receive a 'gensim.interfaces.TransformedCorpus object at 0x7f0ecfa69d50'

how can I extract topics from that?

I already tried

`lsa = models.LdaModel(new_topics, id2word=dictionary, num_topics=1, passes=2)
corpus_lda = lsa[new_topics]
print(lsa.print_topics(num_topics=1, num_words=7)

and

print(corpus_lda.print_topics(num_topics=1, num_words=7) `

but that return topics not relationed to my new document. Where is my mistake? I'm miss understanding something?

**If a run a new model using the dictionary and corpus created above, I receive the correct topics, my point is: how re-use my model? is correctly re-use that wiki_model?

Thank you.

902

asked Jul 26 '17 03:07

Marco Oliveira

3 Answers

I was facing the same problem. This code will solve your problem:

new_topics = new_lda[corpus]

for topic in new_topics:

      print(topic)

This will give you a list of tuples of form (topic number, probability)

answered Oct 24 '22 08:10

Lavanya

From the 'Topics_and_Transformation.ipynb' tutorial prepared by the RaRe Technologies people:

Converting the entire corpus at the time of calling corpus_transformed = model[corpus] would mean storing the result in main memory, and that contradicts gensim’s objective of memory-independence.

If you will be iterating over the transformed corpus_transformed multiple times, and the transformation is costly, serialize the resulting corpus to disk first and continue using that.

Hope it helps.

answered Oct 24 '22 08:10

simone

This has been answered, but here is some code for anyone looking to also export the classification of unseen documents to a CSV file.

#Access the unseen corpus
corpus_test = [id2word.doc2bow(doc) for doc in data_test_lemmatized]

#Transform into LDA space based on old
lda_unseen = lda_model[corpus_test] 

#Print results, export to csv
for topic in lda_unseen:
      print(topic)

topic_probability = []
for t in lda_test:
      #print(t)
      topic_probability.append(t)

results_test = pd.DataFrame(topic_probability,columns=['Topic 1','Topic 2',
                                                       'Topic 3','Topic 4',
                                                       'Topic 5','Topic n'])

result_test.to_csv('test_results.csv', index=True, header=True)

Code inspired from this post.

answered Oct 24 '22 09:10

Anavir

Related questions
                            
                                What is different between doc2vec models when the dbow_words is set to 1 or 0?
                            
                                UnpicklingError: invalid load key, '3'
                            
                                Is there any way to get the vocabulary size from doc2vec model?
                            
                                Python: What is the "size" parameter in Gensim Word2vec model class
                            
                                How to run tsne on word2vec created from gensim?
                            
                                Pyspark - Load trained model word2vec
                            
                                Python tf-idf: fast way to update the tf-idf matrix
                            
                                How to load embeddings (in tsv file) generated from StarSpace
                            
                                Continue training a FastText model
                            
                                Not efficiently to use multi-Core CPU for training Doc2vec with gensim
                            
                                What is the difference between gensim LabeledSentence and TaggedDocument
                            
                                Working with google word2vec .bin files in gensim python
                            
                                How to access topic words only in gensim
                            
                                How to use pretrained Word2Vec model in Tensorflow
                            
                                How to filter out words with low tf-idf in a corpus with gensim?
                            
                                Reduce Google's Word2Vec model with Gensim
                            
                                How to print out the full distribution of words in an LDA topic in gensim?
                            
                                gensim word2vec - array dimensions in updating with online word embedding
                            
                                Gensim Word2Vec select minor set of word vectors from pretrained model

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With