Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

gensim.interfaces.TransformedCorpus - How use?

Tags:

gensim

lda

I'm relative new in the world of Latent Dirichlet Allocation. I am able to generate a LDA Model following the Wikipedia tutorial and I'm able to generate a LDA model with my own documents. My step now is try understand how can I use a previus generated model to classify unseen documents. I'm saving my "lda_wiki_model" with

id2word =gensim.corpora.Dictionary.load_from_text('ptwiki_wordids.txt.bz2')

    mm = gensim.corpora.MmCorpus('ptwiki_tfidf.mm')

    lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)
    lda.save('lda_wiki_model.lda')

And I'm loading the same model with:

new_lda = gensim.models.LdaModel.load(path + 'lda_wiki_model.lda') #carrega o modelo

I have a "new_doc.txt", and I turn my document into a id<-> term dictionary and converted this tokenized document to "document-term matrix"

But when I run new_topics = new_lda[corpus] I receive a 'gensim.interfaces.TransformedCorpus object at 0x7f0ecfa69d50'

how can I extract topics from that?

I already tried

`lsa = models.LdaModel(new_topics, id2word=dictionary, num_topics=1, passes=2)
corpus_lda = lsa[new_topics]
print(lsa.print_topics(num_topics=1, num_words=7)

and

print(corpus_lda.print_topics(num_topics=1, num_words=7) `

but that return topics not relationed to my new document. Where is my mistake? I'm miss understanding something?

**If a run a new model using the dictionary and corpus created above, I receive the correct topics, my point is: how re-use my model? is correctly re-use that wiki_model?

Thank you.

like image 902
Marco Oliveira Avatar asked Jul 26 '17 03:07

Marco Oliveira


People also ask

What are basic interfaces in Gensim?

Basic interfaces used across the whole Gensim package. These interfaces are used for building corpora, model transformation and similarity queries. The interfaces are realized as abstract base classes.

Is it possible to convert the entire corpus in Gensim?

From the ' Topics_and_Transformation.ipynb ' tutorial prepared by the RaRe Technologies people: Converting the entire corpus at the time of calling corpus_transformed = model [corpus] would mean storing the result in main memory, and that contradicts gensim’s objective of memory-independence.

How to call save_Corpus () internally in Gensim?

In this case, save_corpus () is automatically called internally by serialize (), which does save_corpus () plus saves the index at the same time. Calling serialize () is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus ().

What kind of text inputs can Gensim handle?

It is this Dictionary and the bag-of-words (Corpus) that are used as inputs to topic modeling and other models that Gensim specializes in. Alright, what sort of text inputs can gensim handle? The input text typically comes in 3 different forms: As sentences stored in python’s native list object


3 Answers

I was facing the same problem. This code will solve your problem:

new_topics = new_lda[corpus]

for topic in new_topics:

      print(topic)

This will give you a list of tuples of form (topic number, probability)

like image 73
Lavanya Avatar answered Oct 24 '22 08:10

Lavanya


From the 'Topics_and_Transformation.ipynb' tutorial prepared by the RaRe Technologies people:

Converting the entire corpus at the time of calling corpus_transformed = model[corpus] would mean storing the result in main memory, and that contradicts gensim’s objective of memory-independence.

If you will be iterating over the transformed corpus_transformed multiple times, and the transformation is costly, serialize the resulting corpus to disk first and continue using that.

Hope it helps.

like image 36
simone Avatar answered Oct 24 '22 08:10

simone


This has been answered, but here is some code for anyone looking to also export the classification of unseen documents to a CSV file.

#Access the unseen corpus
corpus_test = [id2word.doc2bow(doc) for doc in data_test_lemmatized]

#Transform into LDA space based on old
lda_unseen = lda_model[corpus_test] 

#Print results, export to csv
for topic in lda_unseen:
      print(topic)

topic_probability = []
for t in lda_test:
      #print(t)
      topic_probability.append(t)

results_test = pd.DataFrame(topic_probability,columns=['Topic 1','Topic 2',
                                                       'Topic 3','Topic 4',
                                                       'Topic 5','Topic n'])

result_test.to_csv('test_results.csv', index=True, header=True)

Code inspired from this post.

like image 2
Anavir Avatar answered Oct 24 '22 09:10

Anavir