I'm relative new in the world of Latent Dirichlet Allocation. I am able to generate a LDA Model following the Wikipedia tutorial and I'm able to generate a LDA model with my own documents. My step now is try understand how can I use a previus generated model to classify unseen documents. I'm saving my "lda_wiki_model" with
id2word =gensim.corpora.Dictionary.load_from_text('ptwiki_wordids.txt.bz2')
mm = gensim.corpora.MmCorpus('ptwiki_tfidf.mm')
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)
lda.save('lda_wiki_model.lda')
And I'm loading the same model with:
new_lda = gensim.models.LdaModel.load(path + 'lda_wiki_model.lda') #carrega o modelo
I have a "new_doc.txt", and I turn my document into a id<-> term dictionary and converted this tokenized document to "document-term matrix"
But when I run new_topics = new_lda[corpus]
I receive a
'gensim.interfaces.TransformedCorpus object at 0x7f0ecfa69d50'
how can I extract topics from that?
I already tried
`lsa = models.LdaModel(new_topics, id2word=dictionary, num_topics=1, passes=2)
corpus_lda = lsa[new_topics]
print(lsa.print_topics(num_topics=1, num_words=7)
and
print(corpus_lda.print_topics(num_topics=1, num_words=7
)
`
but that return topics not relationed to my new document. Where is my mistake? I'm miss understanding something?
**If a run a new model using the dictionary and corpus created above, I receive the correct topics, my point is: how re-use my model? is correctly re-use that wiki_model?
Thank you.
Basic interfaces used across the whole Gensim package. These interfaces are used for building corpora, model transformation and similarity queries. The interfaces are realized as abstract base classes.
From the ' Topics_and_Transformation.ipynb ' tutorial prepared by the RaRe Technologies people: Converting the entire corpus at the time of calling corpus_transformed = model [corpus] would mean storing the result in main memory, and that contradicts gensim’s objective of memory-independence.
In this case, save_corpus () is automatically called internally by serialize (), which does save_corpus () plus saves the index at the same time. Calling serialize () is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus ().
It is this Dictionary and the bag-of-words (Corpus) that are used as inputs to topic modeling and other models that Gensim specializes in. Alright, what sort of text inputs can gensim handle? The input text typically comes in 3 different forms: As sentences stored in python’s native list object
I was facing the same problem. This code will solve your problem:
new_topics = new_lda[corpus]
for topic in new_topics:
print(topic)
This will give you a list of tuples of form (topic number, probability)
From the 'Topics_and_Transformation.ipynb' tutorial prepared by the RaRe Technologies people:
Converting the entire corpus at the time of calling
corpus_transformed = model[corpus]
would mean storing the result in main memory, and that contradicts gensim’s objective of memory-independence.If you will be iterating over the transformed corpus_transformed multiple times, and the transformation is costly, serialize the resulting corpus to disk first and continue using that.
Hope it helps.
This has been answered, but here is some code for anyone looking to also export the classification of unseen documents to a CSV file.
#Access the unseen corpus
corpus_test = [id2word.doc2bow(doc) for doc in data_test_lemmatized]
#Transform into LDA space based on old
lda_unseen = lda_model[corpus_test]
#Print results, export to csv
for topic in lda_unseen:
print(topic)
topic_probability = []
for t in lda_test:
#print(t)
topic_probability.append(t)
results_test = pd.DataFrame(topic_probability,columns=['Topic 1','Topic 2',
'Topic 3','Topic 4',
'Topic 5','Topic n'])
result_test.to_csv('test_results.csv', index=True, header=True)
Code inspired from this post.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With