I have set myself up with Amazon Elastic MapReduce in order to preform various standard machine learning tasks. I have used Python extensively for local machine learning in the past and I do not know Java.
As far as I can tell there are no well developed Python libraries for distributed machine learning. Java on the other hand has Apache Mahout and the more recent Oryx from Cloudera.
Essentially it seems I have to choose between two options. Slog through parallelising my own algorithms to use with Hadoop streaming or one of the Python wrapper for Hadoop until decent libraries exist or jump ship to Java so that I can use Mahout/Oryx. There is a world of difference between writing your own MapReduce word count code and writing your own MapReduce SVM! Even with with help of great tutorials like this.
I don't know which is the wiser choice, so my question is:
A) Is there some Python library I have missed which would be useful? If not, do you know if there are any in development which will be useful in the near future?
B) If the answer to the above is no then would my time be better spent jumping ship to Java?
Hadoop Streaming is a feature that comes with Hadoop and allows users or developers to use various different languages for writing MapReduce programs like Python, C++, Ruby, etc. It supports all the languages that can read from standard input and write to standard output.
1. MLlib: No conversation about deep learning tools should begin without mention of Apache's own open-source machine learning library for Spark and Hadoop. MLlib features a host of common algorithms and data types, all designed to run at speed and scale.
To execute Python in Hadoop, we will need to use the Hadoop Streaming library to pipe the Python executable into the Java framework. As a result, we need to process the Python input from STDIN. Run ls and you should find mapper.py and reducer.py in the namenode container.
MapReduce is a programming model that enables large volumes of data to be processed and generated by dividing work into independent tasks and executing the tasks in parallel across a cluster of machines.
I do not know of any library that could be used natively in Python for machine learning on Hadoop, but an easy solution would be to use the jpype
module, which basically allows you to interact with Java from within your Python code.
You can for example start a JVM like this:
from jpype import *
jvm = None
def start_jpype():
global jvm
if (jvm is None):
cpopt="-Djava.class.path={cp}".format(cp=classpath)
startJVM(jvmlib,"-ea",cpopt)
jvm="started"
There is a very good tutorial on the topic here, which explains you how to use KMeans clustering from your Python code using Mahout.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With