Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I pass a parameter to a python Hadoop streaming job?

For a python Hadoop streaming job, how do I pass a parameter to, for example, the reducer script so that it behaves different based on the parameter being passed in?

I understand that streaming jobs are called in the format of:

hadoop jar hadoop-streaming.jar -input -output -mapper mapper.py -reducer reducer.py ...

I want to affect reducer.py.

like image 530
zzztimbo Avatar asked Mar 01 '12 00:03

zzztimbo


3 Answers

You can -reducer as the below command

hadoop jar hadoop-streaming.jar \
-mapper 'count_mapper.py arg1 arg2' -file count_mapper.py \
-reducer 'count_reducer.py arg3' -file count_reducer.py \

you can revise this Link

like image 56
Mohamed El-Touny Avatar answered Nov 15 '22 20:11

Mohamed El-Touny


The argument to the command line option -reducer can be any command, so you can try:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input inputDirs \
    -output outputDir \
    -mapper myMapper.py \
    -reducer 'myReducer.py 1 2 3' \
    -file myMapper.py \
    -file myReducer.py

assuming myReducer.py is made executable. Disclaimer: I have not tried it, but I have passed similar complex strings to -mapper and -reducer before.

That said, have you tried the

-cmdenv name=value

option, and just have your Python reducer get its value from the environment? It's just another way to do things.

like image 26
Ray Toal Avatar answered Nov 15 '22 19:11

Ray Toal


In your Python code,

import os
(...)
os.environ["PARAM_OPT"]

In your Hapdoop command include:

hadoop jar \
(...)
-cmdenv PARAM_OPT=value\
(...)
like image 20
luismartingil Avatar answered Nov 15 '22 19:11

luismartingil