Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Topic Modeling tool for large data set (30GB)

I'm looking for some topic modeling tool which can be applicable to a large data set.

My current data set for training is 30 GB. I tried MALLET topic modeling, but always I got OutOfMemoryError.

If you have any tips, please let me know.

like image 679
Benben Avatar asked Oct 01 '22 08:10

Benben


1 Answers

There are many options available to you, and this response is agnostic as to how they compare.

I think that the important thing with such a large dataset is the method of approximate posterior inference used, and not necessarily the software implementation. According to this paper, online Variational Bayes inference is much more efficient, in terms of time and space, than Gibbs sampling. Though I've never used it, the gensim package looks good. It's in python, and there are in-depth tutorials at the project's webpage.

For code that comes straight from the source, see the webpage of David Blei, one of the authors of the LDA model, here. He links to more than a few implementations, in a variety of languages (R, Java, C++).

like image 88
sinwav Avatar answered Nov 10 '22 03:11

sinwav