Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I use a tfidf corpus or just corpus to inference documents using LDA?

Tags:

python

gensim

lda

I am wondering whether it's either a TFIDF corpus to be used or just corpus to be used when we are inference documents using LDA in gensim

Here is an example

from gensim import corpora, models
import numpy.random
numpy.random.seed(10)

doc0 = [(0, 1), (1, 1)]
doc1 = [(0,1)] 
doc2 = [(0, 1), (1, 1)]
doc3 = [(0, 3), (1, 1)]

corpus = [doc0,doc1,doc2,doc3]
dictionary = corpora.Dictionary(corpus)

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
corpus_tfidf.save('x.corpus_tfidf')

corpus_tfidf = corpora.MmCorpus.load('x.corpus_tfidf')

lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)

#which one i should use from this   
**corpus_lda = lda[corpus]**          #this one 
**corpus_LDA = lda[corpus_tfidf ]**   #or this one?


corpus_lda.save('x.corpus_lda')

for i,j in enumerate(corpus_lda):
    print j, corpus[i]
like image 762
Nipun Alahakoon Avatar asked Nov 26 '14 11:11

Nipun Alahakoon


1 Answers

According to Gensim's mailing list (last post in particular) the standard procedure would be to use a bag of words corpus. You can use a TF-IDF corpus, but it seems to be unclear what kind of effect this would have.

like image 88
MrFancypants Avatar answered Nov 09 '22 15:11

MrFancypants