Are there any distributed machine learning libraries for using Python with Hadoop? [closed]

Tags:

I have set myself up with Amazon Elastic MapReduce in order to preform various standard machine learning tasks. I have used Python extensively for local machine learning in the past and I do not know Java.

As far as I can tell there are no well developed Python libraries for distributed machine learning. Java on the other hand has Apache Mahout and the more recent Oryx from Cloudera.

Essentially it seems I have to choose between two options. Slog through parallelising my own algorithms to use with Hadoop streaming or one of the Python wrapper for Hadoop until decent libraries exist or jump ship to Java so that I can use Mahout/Oryx. There is a world of difference between writing your own MapReduce word count code and writing your own MapReduce SVM! Even with with help of great tutorials like this.

I don't know which is the wiser choice, so my question is:

A) Is there some Python library I have missed which would be useful? If not, do you know if there are any in development which will be useful in the near future?

B) If the answer to the above is no then would my time be better spent jumping ship to Java?

420

asked Jan 09 '13 11:01

iRoygbiv

1 Answers

I do not know of any library that could be used natively in Python for machine learning on Hadoop, but an easy solution would be to use the jpype module, which basically allows you to interact with Java from within your Python code.

You can for example start a JVM like this:

from jpype import *

jvm = None

def start_jpype():
    global jvm
    if (jvm is None):
        cpopt="-Djava.class.path={cp}".format(cp=classpath)
        startJVM(jvmlib,"-ea",cpopt)
        jvm="started"

There is a very good tutorial on the topic here, which explains you how to use KMeans clustering from your Python code using Mahout.

answered Sep 27 '22 22:09

Charles Menguy

Related questions
                            
                                How can I generalize my pandas data grouping to more than 3 dimensions?
                            
                                nginx + uwsgi + flask - disabling custom error pages
                            
                                scrapy authentication login with cookies
                            
                                Why do up and down arrow commands not work in the Python command line interpreter?
                            
                                Google Cloud Messaging HTTP Error 400: Bad Request
                            
                                double quotes in string representation
                            
                                How do I print out the full url with tweepy?
                            
                                Labels for Django select form field
                            
                                Python multiprocessing+logging.FileHandler
                            
                                Using @ndb.tasklet or @ndb.synctasklet in Google App Engine
                            
                                How to expand input buffer size of pyserial
                            
                                Escape key to Close WxPython GUI
                            
                                Is there any way to run a python script on remote machine without sending it?
                            
                                Job processing via web application: real-time status updates and backend messaging
                            
                                Debugging options w/ Python, Flask and Sublime Text 2
                            
                                How to get VBOs to work with Python and PyOpenGL
                            
                                Pandas rolling apply with missing data
                            
                                CSV export in Stream (from Django admin on Heroku)
                            
                                Pythonic way to import data from multiple files into an array
                            
                                Python - Understanding the send function of a generator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Are there any distributed machine learning libraries for using Python with Hadoop? [closed]

Tags:

python

hadoop

mapreduce

hadoop-streaming

elastic-map-reduce

iRoygbiv

People also ask

1 Answers

Charles Menguy

Recent Activity

Donate For Us