Hadoop Streaming Job failed in python

Question

I have a mapreduce job written in Python. The program was tested successfully in linux env but failed when I run it under Hadoop.

Here is the job command:

hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1+169.127-streaming.jar \
   -input /data/omni/20110115/exp6-10122 -output /home/yan/visitorpy.out \
   -mapper SessionMap.py   -reducer  SessionRed.py  -file SessionMap.py \
   -file  SessionRed.py

The mode of Session*.py is 755, and #!/usr/bin/env python is the top line in the *.py file. Mapper.py is:

#!/usr/bin/env python
import sys
 for line in sys.stdin:
         val=line.split("	")
         (visidH,visidL,sessionID)=(val[4],val[5],val[108])
         print "%s%s	%s" % (visidH,visidL,sessionID)

Error from the log:

java.io.IOException: Broken pipe
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
    at java.io.DataOutputStream.write(DataOutputStream.java:90)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:110)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:126)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

Akash Agrawal · Accepted Answer

I got the same problem and was wondering because when I test my mapper and reducer on test data it run. But when I run the same test set via hadoop map-reduce, I used to get the same problem.

How to test your code locally:

cat <test file> | python mapper.py | sort | python reducer.py

On more investigation, I found that I didn't include the 'shebang line' in my mapper.py script.

#!/usr/bin/python

Please add above line as first line of your python script and leave a blank line after this.

If you need to know more about 'shebang line', please read Why do people write #!/usr/bin/env python on the first line of a Python script?

Michel Samia · Answer

You can find the python error messages (for example traceback) and another things written by your script to stderr in hadoop web interface. It is a little bit hidden, but you will find it in link the streaming provides you with. You click to 'Map' or 'Reduce', then click on any task and then in column Task logs on 'All'

Hadoop Streaming Job failed in python

Tags:

python

hadoop

mapreduce

Yuhang

2 Answers

Akash Agrawal

Michel Samia

Recent Activity

Donate For Us

Hadoop Streaming Job failed in python

Tags:

python

hadoop

mapreduce

Yuhang

2 Answers

Akash Agrawal

Michel Samia

Related questions

Recent Activity

Donate For Us