Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I include a python package with Hadoop streaming job?

Tags:

python

hadoop

I am trying include a python package (NLTK) with a Hadoop streaming job, but am not sure how to do this without including every file manually via the CLI argument, "-file".

Edit: One solution would be to install this package on all the slaves, but I don't have that option currently.

like image 494
Dolan Antenucci Avatar asked Jul 25 '11 03:07

Dolan Antenucci


People also ask

Can I use Python with Hadoop?

Compatibility with Hadoop and Spark: Hadoop framework is written in Java language; however, Hadoop programs can be coded in Python or C++ language.

What support is available in Python for Hadoop streaming?

Features of Hadoop Streaming Hadoop Streaming supports almost all types of programming languages such as Python, C++, Ruby, Perl etc. The entire Hadoop Streaming framework runs on Java. However, the codes might be written in different languages as mentioned in the above point.

How do I run a Python file in Hadoop?

To execute Python in Hadoop, we will need to use the Hadoop Streaming library to pipe the Python executable into the Java framework. As a result, we need to process the Python input from STDIN. Run ls and you should find mapper.py and reducer.py in the namenode container.


4 Answers

Just came across this gem of a solution: http://blog.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/

first create zip w/ the libraries desired

zip -r nltkandyaml.zip nltk yaml
mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod

next, include via Hadoop stream "-file" argument:

hadoop -file nltkandyaml.zip

finally, load the libaries via python:

import zipimport
importer = zipimport.zipimporter('nltkandyaml.mod')
yaml = importer.load_module('yaml')
nltk = importer.load_module('nltk') 

Additionally, this page summarizes how to include a corpus: http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

download and unzip the wordnet corpus

cd wordnet
zip -r ../wordnet-flat.zip *

in python:

wn = WordNetCorpusReader(nltk.data.find('lib/wordnet-flat.zip'))
like image 98
Dolan Antenucci Avatar answered Sep 16 '22 16:09

Dolan Antenucci


I would zip up the package into a .tar.gz or a .zip and pass the entire tarball or archive in a -file option to your hadoop command. I've done this in the past with Perl but not Python.

That said, I would think this would still work for you if you use Python's zipimport at http://docs.python.org/library/zipimport.html, which allows you to import modules directly from a zip.

like image 20
Ray Toal Avatar answered Sep 18 '22 16:09

Ray Toal


You can use zip lib like this:

import sys
sys.path.insert(0, 'nltkandyaml.mod')
import ntlk
import yaml
like image 44
Jimmy Zhang Avatar answered Sep 19 '22 16:09

Jimmy Zhang


An example of loading external python package nltk
refer to the answer
Running extrnal python lib like (NLTK) with hadoop streaming
I followed following approach and ran the nltk package in with hadoop streaming successfully.

Assumption, you have already your package or (nltk in my case)in your system

first:

zip -r nltk.zip nltk
mv ntlk.zip /place/it/anywhere/you/like/nltk.mod

Why any where will work?
Ans :- Because we will provide path to this .mod zipped file through command line, we don't need to worry much about it.

second:
changes in your mapper or .py file

#Hadoop cannot unzip files by default thus you need to unzip it   
import zipimport
importer = zipimport.zipimporter('nltk.mod')
nltk = importer.load_module('nltk')

#now import what ever you like from nltk
from nltk import tree
from nltk import load_parser
from nltk.corpus import stopwords
nltk.data.path += ["."]

third: command line argument to run map-reduce

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-file /your/path/to/mapper/mapper.py \
-mapper '/usr/local/bin/python3.4 mapper.py' \
-file /your/path/to/reducer/reducer.py \
-reducer '/usr/local/bin/python3.4 reducer.py' \
-file /your/path/to/nltkzippedmodfile/nltk.mod \
-input /your/path/to/HDFS/input/check.txt -output /your/path/to/HDFS/output/

Thus, above step solved my problem and I think it should solve others as well.
cheers,

like image 41
Saurab Dulal Avatar answered Sep 16 '22 16:09

Saurab Dulal