I am trying include a python package (NLTK) with a Hadoop streaming job, but am not sure how to do this without including every file manually via the CLI argument, "-file".
Edit: One solution would be to install this package on all the slaves, but I don't have that option currently.
Compatibility with Hadoop and Spark: Hadoop framework is written in Java language; however, Hadoop programs can be coded in Python or C++ language.
Features of Hadoop Streaming Hadoop Streaming supports almost all types of programming languages such as Python, C++, Ruby, Perl etc. The entire Hadoop Streaming framework runs on Java. However, the codes might be written in different languages as mentioned in the above point.
To execute Python in Hadoop, we will need to use the Hadoop Streaming library to pipe the Python executable into the Java framework. As a result, we need to process the Python input from STDIN. Run ls and you should find mapper.py and reducer.py in the namenode container.
Just came across this gem of a solution: http://blog.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/
first create zip w/ the libraries desired
zip -r nltkandyaml.zip nltk yaml
mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod
next, include via Hadoop stream "-file" argument:
hadoop -file nltkandyaml.zip
finally, load the libaries via python:
import zipimport
importer = zipimport.zipimporter('nltkandyaml.mod')
yaml = importer.load_module('yaml')
nltk = importer.load_module('nltk')
Additionally, this page summarizes how to include a corpus: http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/
download and unzip the wordnet corpus
cd wordnet
zip -r ../wordnet-flat.zip *
in python:
wn = WordNetCorpusReader(nltk.data.find('lib/wordnet-flat.zip'))
I would zip up the package into a .tar.gz
or a .zip
and pass the entire tarball or archive in a -file
option to your hadoop command. I've done this in the past with Perl but not Python.
That said, I would think this would still work for you if you use Python's zipimport
at http://docs.python.org/library/zipimport.html, which allows you to import modules directly from a zip.
You can use zip lib like this:
import sys
sys.path.insert(0, 'nltkandyaml.mod')
import ntlk
import yaml
An example of loading external python package nltk
refer to the answer
Running extrnal python lib like (NLTK) with hadoop streaming
I followed following approach and ran the nltk package in with hadoop streaming successfully.
Assumption, you have already your package or (nltk in my case)in your system
first:
zip -r nltk.zip nltk
mv ntlk.zip /place/it/anywhere/you/like/nltk.mod
Why any where will work?
Ans :- Because we will provide path to this .mod zipped file through command line, we don't need to worry much about it.
second:
changes in your mapper or .py file
#Hadoop cannot unzip files by default thus you need to unzip it
import zipimport
importer = zipimport.zipimporter('nltk.mod')
nltk = importer.load_module('nltk')
#now import what ever you like from nltk
from nltk import tree
from nltk import load_parser
from nltk.corpus import stopwords
nltk.data.path += ["."]
third: command line argument to run map-reduce
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-file /your/path/to/mapper/mapper.py \
-mapper '/usr/local/bin/python3.4 mapper.py' \
-file /your/path/to/reducer/reducer.py \
-reducer '/usr/local/bin/python3.4 reducer.py' \
-file /your/path/to/nltkzippedmodfile/nltk.mod \
-input /your/path/to/HDFS/input/check.txt -output /your/path/to/HDFS/output/
Thus, above step solved my problem and I think it should solve others as well.
cheers,
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With