I'm looking for some topic modeling tool which can be applicable to a large data set.
My current data set for training is 30 GB. I tried MALLET topic modeling, but always I got OutOfMemoryError.
If you have any tips, please let me know.
There are many options available to you, and this response is agnostic as to how they compare.
I think that the important thing with such a large dataset is the method of approximate posterior inference used, and not necessarily the software implementation. According to this paper, online Variational Bayes inference is much more efficient, in terms of time and space, than Gibbs sampling. Though I've never used it, the gensim package looks good. It's in python, and there are in-depth tutorials at the project's webpage.
For code that comes straight from the source, see the webpage of David Blei, one of the authors of the LDA model, here. He links to more than a few implementations, in a variety of languages (R, Java, C++).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With