Can we use a self made corpus for training for LDA using gensim?

1 Answers

After going through the documentation of the Gensim package, I found out that there are total 4 ways of transforming a text repository into a corpus.

There are total 4 formats for the corpus:

Market Matrix (.mm)
SVM Light (.svmlight)
Blie Format (.lad-c)
Low Format (.low)

In this problem, as mentioned above there are total of 19,188 documents in the database. One has to read each document and remove stopwords and punctuations from the sentences, which can be done using nltk.

import gensim
from gensim import corpora, similarities, models

##
##Text Preprocessing is done here using nltk
##

##Saving of the dictionary and corpus is done here
##final_text contains the tokens of all the documents

dictionary = corpora.Dictionary(final_text)
dictionary.save('questions.dict');
corpus = [dictionary.doc2bow(text) for text in final_text]
corpora.MmCorpus.serialize('questions.mm', corpus)
corpora.SvmLightCorpus.serialize('questions.svmlight', corpus)
corpora.BleiCorpus.serialize('questions.lda-c', corpus)
corpora.LowCorpus.serialize('questions.low', corpus)

##Then the dictionary and corpus can be used to train using LDA

mm = corpora.MmCorpus('questions.mm')
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=dictionary, num_topics=100, update_every=0, chunksize=19188, passes=20)

This way one can transform his dataset to a corpus that can be trained for topic modelling using LDA using gensim package.

186

answered Oct 12 '22 14:10

Animesh Pandey

Related questions
                            
                                How to do elif statments more elegantly if appending to array in python
                            
                                Exponential of very small number in python
                            
                                What determines the vertical space in Reportlab tables?
                            
                                Using __class__ to create instances
                            
                                How to declare 2D list in Cython
                            
                                Run a particular Python function in C# with IronPython
                            
                                Generating all unique pair permutations
                            
                                Integer division & modulo operation with negative operands in Python
                            
                                In python on OSX with HFS+ how can I get the correct case of an existing filename?
                            
                                disabling autoescape in flask
                            
                                Running more than one class in Cherrypy
                            
                                Matrix Multiplication of a Pandas DataFrame and Series
                            
                                Why does my contextmanager-function not work like my contextmanager class in python?
                            
                                Installing pytesser
                            
                                Write data to hdf file using multiprocessing
                            
                                BeautifulSoup 4, findNext() function
                            
                                How to parse Django templates for template tags
                            
                                Django: Set datetime in views to utc+1
                            
                                mapping values are not allowed here ... in foo.py
                            
                                Python 3 replacement for PyFile_AsFile

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can we use a self made corpus for training for LDA using gensim?

Tags:

python

gensim

lda

Animesh Pandey

People also ask

1 Answers

Animesh Pandey

Recent Activity

Donate For Us