Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are there any distributed machine learning libraries for using Python with Hadoop? [closed]

I have set myself up with Amazon Elastic MapReduce in order to preform various standard machine learning tasks. I have used Python extensively for local machine learning in the past and I do not know Java.

As far as I can tell there are no well developed Python libraries for distributed machine learning. Java on the other hand has Apache Mahout and the more recent Oryx from Cloudera.

Essentially it seems I have to choose between two options. Slog through parallelising my own algorithms to use with Hadoop streaming or one of the Python wrapper for Hadoop until decent libraries exist or jump ship to Java so that I can use Mahout/Oryx. There is a world of difference between writing your own MapReduce word count code and writing your own MapReduce SVM! Even with with help of great tutorials like this.

I don't know which is the wiser choice, so my question is:

A) Is there some Python library I have missed which would be useful? If not, do you know if there are any in development which will be useful in the near future?

B) If the answer to the above is no then would my time be better spent jumping ship to Java?

like image 420
iRoygbiv Avatar asked Jan 09 '13 11:01

iRoygbiv


People also ask

Can I use Python with Hadoop?

Hadoop Streaming is a feature that comes with Hadoop and allows users or developers to use various different languages for writing MapReduce programs like Python, C++, Ruby, etc. It supports all the languages that can read from standard input and write to standard output.

Which is the machine learning Library in Hadoop?

1. MLlib: No conversation about deep learning tools should begin without mention of Apache's own open-source machine learning library for Spark and Hadoop. MLlib features a host of common algorithms and data types, all designed to run at speed and scale.

How do I run Python in Hadoop?

To execute Python in Hadoop, we will need to use the Hadoop Streaming library to pipe the Python executable into the Java framework. As a result, we need to process the Python input from STDIN. Run ls and you should find mapper.py and reducer.py in the namenode container.

What is MapReduce in Python?

MapReduce is a programming model that enables large volumes of data to be processed and generated by dividing work into independent tasks and executing the tasks in parallel across a cluster of machines.


1 Answers

I do not know of any library that could be used natively in Python for machine learning on Hadoop, but an easy solution would be to use the jpype module, which basically allows you to interact with Java from within your Python code.

You can for example start a JVM like this:

from jpype import *

jvm = None

def start_jpype():
    global jvm
    if (jvm is None):
        cpopt="-Djava.class.path={cp}".format(cp=classpath)
        startJVM(jvmlib,"-ea",cpopt)
        jvm="started"

There is a very good tutorial on the topic here, which explains you how to use KMeans clustering from your Python code using Mahout.

like image 60
Charles Menguy Avatar answered Sep 27 '22 22:09

Charles Menguy