Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop Streaming Job failed in python

I have a mapreduce job written in Python. The program was tested successfully in linux env but failed when I run it under Hadoop.

Here is the job command:

hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1+169.127-streaming.jar \
   -input /data/omni/20110115/exp6-10122 -output /home/yan/visitorpy.out \
   -mapper SessionMap.py   -reducer  SessionRed.py  -file SessionMap.py \
   -file  SessionRed.py

The mode of Session*.py is 755, and #!/usr/bin/env python is the top line in the *.py file. Mapper.py is:

#!/usr/bin/env python
import sys
 for line in sys.stdin:
         val=line.split("\t")
         (visidH,visidL,sessionID)=(val[4],val[5],val[108])
         print "%s%s\t%s" % (visidH,visidL,sessionID)

Error from the log:

java.io.IOException: Broken pipe
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
    at java.io.DataOutputStream.write(DataOutputStream.java:90)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
    at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:110)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:126)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
like image 829
Yuhang Avatar asked May 04 '11 21:05

Yuhang


2 Answers

I got the same problem and was wondering because when I test my mapper and reducer on test data it run. But when I run the same test set via hadoop map-reduce, I used to get the same problem.

How to test your code locally:

cat <test file> | python mapper.py | sort | python reducer.py

On more investigation, I found that I didn't include the 'shebang line' in my mapper.py script.

#!/usr/bin/python

Please add above line as first line of your python script and leave a blank line after this.

If you need to know more about 'shebang line', please read Why do people write #!/usr/bin/env python on the first line of a Python script?

like image 72
Akash Agrawal Avatar answered Oct 10 '22 11:10

Akash Agrawal


You can find the python error messages (for example traceback) and another things written by your script to stderr in hadoop web interface. It is a little bit hidden, but you will find it in link the streaming provides you with. You click to 'Map' or 'Reduce', then click on any task and then in column Task logs on 'All'

like image 24
Michel Samia Avatar answered Oct 10 '22 10:10

Michel Samia