Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop Streaming: Mapper 'wrapping' a binary executable

I have a pipeline that I currently run on a large university computer cluster. For publication purposes I'd like to convert it into mapreduce format such that it could be run by anyone on using a hadoop cluster such as amazon webservices (AWS). The pipeline currently consists of as series of python scripts that wrap different binary executables and manage the input and output using the python subprocess and tempfile modules. Unfortunately I didn’t write the binary executables and many of them either don’t take STDIN or don't emit STDOUT in a ‘useable’ fashion (e.g., only sent it to files). These problems are why I’ve wrapped most of them in python.

So far I’ve been able to modify my Python code such that I have a mapper and a reducer that I can run on my local machine in the standard ‘test format.’

$ cat data.txt | mapper.py | reducer.py

The mapper formats each line of data the way the binary it wraps wants it, sends the text to the binary using subprocess.popen (this also allows me to mask a lot of spurious STDOUT), then collects the STOUT I want, and formats it into lines of text appropriate for the reducer. The problems arise when I try to replicate the command on a local hadoop install. I can get the mapper to execute, but it give an error that suggests that it can’t find the binary executable.

File "/Users/me/Desktop/hadoop-0.21.0/./phyml.py", line 69, in main() File "/Users/me/Desktop/hadoop-0.21.0/./mapper.py", line 66, in main phyml(None) File "/Users/me/Desktop/hadoop-0.21.0/./mapper.py", line 46, in phyml ft = Popen(cli_parts, stdin=PIPE, stderr=PIPE, stdout=PIPE) File "/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py", line 621, in init errread, errwrite) File "/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py", line 1126, in _execute_child raise child_exception OSError: [Errno 13] Permission denied

My hadoop command looks like the following:

./bin/hadoop jar /Users/me/Desktop/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar \
-input /Users/me/Desktop/Code/AWS/temp/data.txt \
-output /Users/me/Desktop/aws_test \
-mapper  mapper.py \
-reducer  reducer.py \
-file /Users/me/Desktop/Code/AWS/temp/mapper.py \
-file /Users/me/Desktop/Code/AWS/temp/reducer.py \
-file /Users/me/Desktop/Code/AWS/temp/binary

As I noted above it looks to me like the mapper isn't aware of the binary - perhaps it's not being sent to the compute node? Unfortunately I can't really tell what the problem is. Any help would be greatly appreciated. It would be particulary nice to see some hadoop streaming mappers/reducers written in python that wrap binary executables. I can’t imagine I’m the first one to try to do this! In fact, here is another post asking essentially the same question, but it hasn't been answered yet...

Hadoop/Elastic Map Reduce with binary executable?

like image 824
Nick Crawford Avatar asked Nov 06 '10 15:11

Nick Crawford


2 Answers

After much googling (etc.) I figured out how to include executable binaries/scripts/modules that are accessible to your mappers/reducers. The trick is to upload all you files to hadoop first.

$ bin/hadoop dfs -copyFromLocal /local/file/system/module.py module.py

Then you need to format you streaming command like the following template:

$ ./bin/hadoop jar /local/file/system/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar \
-file /local/file/system/data/data.txt \
-file /local/file/system/mapper.py \
-file /local/file/system/reducer.py \
-cacheFile hdfs://localhost:9000/user/you/module.py#module.py \
-input data.txt \
-output output/ \
-mapper mapper.py \
-reducer reducer.py \
-verbose

If you're linking a python module you'll need to add the following code to your mapper/reducer scripts:

import sys 
sys.path.append('.')
import module

If you're accessing a binary via subprocessing your command should look something like this:

cli = "./binary %s" % (argument)
cli_parts = shlex.split(cli)
mp = Popen(cli_parts, stdin=PIPE, stderr=PIPE, stdout=PIPE)
mp.communicate()[0]

Hope this helps.

like image 97
Nick Crawford Avatar answered Nov 15 '22 03:11

Nick Crawford


Got it running finally

$pid = open2 (my $out, my $in, "./binary") or die "could not run open2";
like image 43
Libin Avatar answered Nov 15 '22 03:11

Libin