Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mallet topic modelling

I have been using mallet for inferring topics for a text file containing 100,000 lines(around 34 MB in mallet format). But now i need to run it for on a file containing a million lines(around 180MB) and I am getting an java.lang.outofmemory exception . Is there a way of splitting the file into smaller ones and build a model for the data present in all the files combined?? thanks in advance

like image 703
fayaz Avatar asked Mar 02 '11 13:03

fayaz


2 Answers

In bin/mallet.bat increase value for this line:

set MALLET_MEMORY=1G
like image 183
metdos Avatar answered Sep 22 '22 16:09

metdos


I'm not sure about scalability of Mallet to big data, but project http://dragon.ischool.drexel.edu/ can store its data in disk backed persistence therefore can scale to unlimited corpus sizes(with low performance of course)

like image 30
yura Avatar answered Sep 21 '22 16:09

yura