Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java or Python distributed compute job (on a student budget)?

I have a large dataset (c. 40G) that I want to use for some NLP (largely embarrassingly parallel) over a couple of computers in the lab, to which i do not have root access, and only 1G of user space. I experimented with hadoop, but of course this was dead in the water-- the data is stored on an external usb hard drive, and i cant load it on to the dfs because of the 1G user space cap. I have been looking into a couple of python based options (as I'd rather use NLTK instead of Java's lingpipe if I can help it), and it seems distributed compute options look like:

  • Ipython
  • DISCO

After my hadoop experience, i am trying to make sure i try and make an informed choice -- any help on what might be more appropriate would be greatly appreciated.

Amazon's EC2 etc not really an option, as i have next to no budget.

like image 328
midget_sadhu Avatar asked May 16 '10 14:05

midget_sadhu


2 Answers

Speak with the IT dept at your school (especially if you are in college), if it is for an assignment or research I bet they would be more than happy to give you more disk space.

like image 161
swanson Avatar answered Sep 20 '22 14:09

swanson


no actual answers; i'd have put this as a comment but on this site you're forced to only answer if you're still a noob

if it's genuinely as parallel as that, and it's only a couple of computers, could you not split the dataset up manually ahead of time?

have you confirmed that there isn't going to be a firewall or similar stopping you using something like that anyway?

you may only have 1GB of user space, but, if linux, what about /tmp ? (if windows, what about %temp% ? )

like image 25
frymaster Avatar answered Sep 18 '22 14:09

frymaster