Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing Term-Document Matrix to Gensim LDA Model

My term-document matrix is in a numpy matrix format, and I have a dictionary to represent the of the term-document matrix.

Is there any way I can easily pass these two into Gensim's LDA model?

tdMatrix = np.load('tdmatrix.npy')
dictionary = cPickle.load(open('dictionary.p', 'r')) # stores term represented by each column

Can I pass this somewhow to gensim.models.ldamodel.LDA?

like image 455
mle Avatar asked Oct 18 '25 15:10

mle


2 Answers

To treat a 2D numpy (or even scipy.sparse.csc) array as a gensim corpus, use the built-in matutils.Scipy2Corpus function.

like image 186
Radim Avatar answered Oct 21 '25 16:10

Radim


I believe Gensim uses pretty much the same structure to represent a bag of words corpus, but I don't think a default dictionary or numpy array would be compatible. Gensim's API lists a few "corpusreaders" that can accommodate various formats, but those seem to be built for importing data from other tool kits. So maybe in your case the easiest solution would be to reconstruct the documents using your matrix and dictionary as a list of separated strings. Then convert your list to Gensim's bag of word corpus and finally to LDA as shown in the tutorials.

This approach has the added benefit that you can apply Gensim's preprocessing functions and filter words with low/high frequencies.

like image 31
MrFancypants Avatar answered Oct 21 '25 16:10

MrFancypants



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!