Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can we use a self made corpus for training for LDA using gensim?

Tags:

python

gensim

lda

I have to apply LDA (Latent Dirichlet Allocation) to get the possible topics from a data base of 20,000 documents that I collected.

How can I use these documents rather than the other corpus available like the Brown Corpus or English Wikipedia as training corpus ?

You can refer this page.

like image 552
Animesh Pandey Avatar asked Apr 27 '13 16:04

Animesh Pandey


People also ask

What are the two main inputs to an LDA topic model using Gensim?

The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus. Let's create them. Gensim creates a unique id for each word in the document.

How do you train LDA?

In order to train a LDA model you need to provide a fixed assume number of topics across your corpus. There are a number of ways you could approach this: Run LDA on your corpus with different numbers of topics and see if word distribution per topic looks sensible.

How many iterations does LDA have?

LDA uses a 4-step iterative process, which produces better results as the number of iterations increases based on the way that probabilities are updated with successive iterations in the LDA algorithm.

What is LDA Gensim?

Latent Dirichlet Allocation(LDA) is a popular algorithm for topic modeling with excellent implementations in the Python's Gensim package. The challenge, however, is how to extract good quality of topics that are clear, segregated and meaningful.


1 Answers

After going through the documentation of the Gensim package, I found out that there are total 4 ways of transforming a text repository into a corpus.

There are total 4 formats for the corpus:

  1. Market Matrix (.mm)
  2. SVM Light (.svmlight)
  3. Blie Format (.lad-c)
  4. Low Format (.low)

In this problem, as mentioned above there are total of 19,188 documents in the database. One has to read each document and remove stopwords and punctuations from the sentences, which can be done using nltk.

import gensim
from gensim import corpora, similarities, models

##
##Text Preprocessing is done here using nltk
##

##Saving of the dictionary and corpus is done here
##final_text contains the tokens of all the documents

dictionary = corpora.Dictionary(final_text)
dictionary.save('questions.dict');
corpus = [dictionary.doc2bow(text) for text in final_text]
corpora.MmCorpus.serialize('questions.mm', corpus)
corpora.SvmLightCorpus.serialize('questions.svmlight', corpus)
corpora.BleiCorpus.serialize('questions.lda-c', corpus)
corpora.LowCorpus.serialize('questions.low', corpus)

##Then the dictionary and corpus can be used to train using LDA

mm = corpora.MmCorpus('questions.mm')
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=dictionary, num_topics=100, update_every=0, chunksize=19188, passes=20)

This way one can transform his dataset to a corpus that can be trained for topic modelling using LDA using gensim package.

like image 186
Animesh Pandey Avatar answered Oct 12 '22 14:10

Animesh Pandey