I am trying include a python package (NLTK) with a Hadoop streaming job, but am not sure how to do this without including every file manually via the CLI argument, "-file". Edit: One solution would be to install this package on all the slaves, but I don't have that option currently.

Just came across this gem of a solution: http://blog.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/ first create zip w/ the libraries desired <pre class="prettyprint"><code>zip -r nltkandyaml.zip nltk yaml mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod </code></pre> next, include via Hadoop stream "-file" argument: <pre class="prettyprint"><code>hadoop -file nltkandyaml.zip </code></pre> finally, load the libaries via python: <pre class="prettyprint"><code>import zipimport importer = zipimport.zipimporter('nltkandyaml.mod') yaml = importer.load_module('yaml') nltk = importer.load_module('nltk') </code></pre> Additionally, this page summarizes how to include a corpus: http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/ download and unzip the wordnet corpus <pre class="prettyprint"><code>cd wordnet zip -r ../wordnet-flat.zip * </code></pre> in python: <pre class="prettyprint"><code>wn = WordNetCorpusReader(nltk.data.find('lib/wordnet-flat.zip')) </code></pre>

I would zip up the package into a <code>.tar.gz</code> or a <code>.zip</code> and pass the entire tarball or archive in a <code>-file</code> option to your hadoop command. I've done this in the past with Perl but not Python. That said, I would think this would still work for you if you use Python's <code>zipimport</code> at http://docs.python.org/library/zipimport.html, which allows you to import modules directly from a zip.

You can use zip lib like this: <pre class="prettyprint"><code>import sys sys.path.insert(0, 'nltkandyaml.mod') import ntlk import yaml </code></pre>

How can I include a python package with Hadoop streaming job?

4 Answers

Just came across this gem of a solution: http://blog.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/

first create zip w/ the libraries desired

zip -r nltkandyaml.zip nltk yaml
mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod

next, include via Hadoop stream "-file" argument:

hadoop -file nltkandyaml.zip

finally, load the libaries via python:

import zipimport
importer = zipimport.zipimporter('nltkandyaml.mod')
yaml = importer.load_module('yaml')
nltk = importer.load_module('nltk')

Additionally, this page summarizes how to include a corpus: http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

download and unzip the wordnet corpus

cd wordnet
zip -r ../wordnet-flat.zip *

in python:

wn = WordNetCorpusReader(nltk.data.find('lib/wordnet-flat.zip'))

answered Sep 16 '22 16:09

Dolan Antenucci

I would zip up the package into a .tar.gz or a .zip and pass the entire tarball or archive in a -file option to your hadoop command. I've done this in the past with Perl but not Python.

That said, I would think this would still work for you if you use Python's zipimport at http://docs.python.org/library/zipimport.html, which allows you to import modules directly from a zip.

answered Sep 18 '22 16:09

Ray Toal

You can use zip lib like this:

import sys
sys.path.insert(0, 'nltkandyaml.mod')
import ntlk
import yaml

answered Sep 19 '22 16:09

Jimmy Zhang

An example of loading external python package nltk
refer to the answer
Running extrnal python lib like (NLTK) with hadoop streaming
I followed following approach and ran the nltk package in with hadoop streaming successfully.

Assumption, you have already your package or (nltk in my case)in your system

first:

zip -r nltk.zip nltk
mv ntlk.zip /place/it/anywhere/you/like/nltk.mod

Why any where will work?
Ans :- Because we will provide path to this .mod zipped file through command line, we don't need to worry much about it.

second:
changes in your mapper or .py file

#Hadoop cannot unzip files by default thus you need to unzip it   
import zipimport
importer = zipimport.zipimporter('nltk.mod')
nltk = importer.load_module('nltk')

#now import what ever you like from nltk
from nltk import tree
from nltk import load_parser
from nltk.corpus import stopwords
nltk.data.path += ["."]

third: command line argument to run map-reduce

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-file /your/path/to/mapper/mapper.py \
-mapper '/usr/local/bin/python3.4 mapper.py' \
-file /your/path/to/reducer/reducer.py \
-reducer '/usr/local/bin/python3.4 reducer.py' \
-file /your/path/to/nltkzippedmodfile/nltk.mod \
-input /your/path/to/HDFS/input/check.txt -output /your/path/to/HDFS/output/

Thus, above step solved my problem and I think it should solve others as well.
cheers,

answered Sep 16 '22 16:09

Saurab Dulal

Related questions
                            
                                tkinter optionmenu first option vanishes
                            
                                How can I tell if Python setuptools is installed?
                            
                                Django - How to use custom template tag with 'if' and 'else' checks? [duplicate]
                            
                                Right function for normalizing input of sklearn SVM
                            
                                Selenium: Element not clickable ... Other Element Would Receive Click
                            
                                Why is a list access O(1) in Python?
                            
                                Leveraging "Copy-on-Write" to Copy Data to Multiprocessing.Pool() Worker Processes
                            
                                How to align axis label to the right or top in matplotlib?
                            
                                How to parse tsv file with python?
                            
                                list comprehension for multiple return function?
                            
                                Associating string representations with an Enum that uses integer values?
                            
                                Pytest - how to skip tests unless you declare an option/flag?
                            
                                Correlation matrix plot with coefficients on one side, scatterplots on another, and distributions on diagonal
                            
                                Should Python unittests be in a separate module?
                            
                                What does "lambda" mean in Python, and what's the simplest way to use it?
                            
                                load python code at runtime
                            
                                python string format suppress/silent keyerror/indexerror [duplicate]
                            
                                Improving Performance of Django ForeignKey Fields in Admin
                            
                                Django admin display multiple fields on the same line
                            
                                Dynamic choices field in Django Models

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I include a python package with Hadoop streaming job?

Tags:

python

hadoop

Dolan Antenucci

People also ask

4 Answers

Dolan Antenucci

Ray Toal

Jimmy Zhang

Saurab Dulal

Recent Activity

Donate For Us