How do I pass a parameter to a python Hadoop streaming job?

Question

For a python Hadoop streaming job, how do I pass a parameter to, for example, the reducer script so that it behaves different based on the parameter being passed in?

I understand that streaming jobs are called in the format of:

hadoop jar hadoop-streaming.jar -input -output -mapper mapper.py -reducer reducer.py ...

I want to affect reducer.py.

Mohamed El-Touny · Accepted Answer

You can -reducer as the below command

hadoop jar hadoop-streaming.jar \
-mapper 'count_mapper.py arg1 arg2' -file count_mapper.py \
-reducer 'count_reducer.py arg3' -file count_reducer.py \

you can revise this Link

Ray Toal · Answer

The argument to the command line option -reducer can be any command, so you can try:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input inputDirs \
    -output outputDir \
    -mapper myMapper.py \
    -reducer 'myReducer.py 1 2 3' \
    -file myMapper.py \
    -file myReducer.py

assuming myReducer.py is made executable. Disclaimer: I have not tried it, but I have passed similar complex strings to -mapper and -reducer before.

That said, have you tried the

-cmdenv name=value

option, and just have your Python reducer get its value from the environment? It's just another way to do things.

luismartingil · Answer

In your Python code,

import os
(...)
os.environ["PARAM_OPT"]

In your Hapdoop command include:

hadoop jar \
(...)
-cmdenv PARAM_OPT=value\
(...)

How do I pass a parameter to a python Hadoop streaming job?

Tags:

python

hadoop

hadoop-streaming

zzztimbo

3 Answers

Mohamed El-Touny

Ray Toal

luismartingil

Recent Activity

Donate For Us

How do I pass a parameter to a python Hadoop streaming job?

Tags:

python

hadoop

hadoop-streaming

zzztimbo

3 Answers

Mohamed El-Touny

Ray Toal

luismartingil

Related questions

Recent Activity

Donate For Us