Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sample map reduce script in python for hive produces exception

I am learning hive. I have setup a table named records. With schema as follows:

year        : string
temperature : int
quality     : int

Here are sample rows

1999 28 3
2000 28 3
2001 30 2

Now I wrote a sample map reduce script in python exactly as specified in the book Hadoop The Definitive Guide:

import re
import sys

for line in sys.stdin:
    (year,tmp,q) = line.strip().split()
    if (tmp != '9999' and re.match("[01459]",q)):
        print "%s\t%s" % (year,tmp)

I run this using following command:

ADD FILE /usr/local/hadoop/programs/sample_mapreduce.py;
SELECT TRANSFORM(year, temperature, quality)
USING 'sample_mapreduce.py'
AS year,temperature;

Execution fails. On the terminal I get this:

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2012-08-23 18:30:28,506 Stage-1 map = 0%,  reduce = 0%
2012-08-23 18:30:59,647 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201208231754_0005 with errors
Error during job, obtaining debugging information...
Examining task ID: task_201208231754_0005_m_000002 (and more) from job job_201208231754_0005
Exception in thread "Thread-103" java.lang.RuntimeException: Error while reading from task log url
    at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
    at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211)
    at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81)
    at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: http://master:50060/tasklog?taskid=attempt_201208231754_0005_m_000000_2&start=-8193
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
    at java.net.URL.openStream(URL.java:1010)
    at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
    ... 3 more

I go to failed job list and this is the stack trace

java.lang.RuntimeException: Hive Runtime Error while closing operators
    at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:226)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hit error while closing ..
    at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:452)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
    at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:193)
    ... 8 more

The same trace repeated 3 times more.

Please, can someone help me with this? What is wrong here? I am going exactly by the book. What seems to be the problem. There are two errors it seems. On terminal it says that it can't read from task log url. In the failed job list, the exception says something different. Please help

like image 323
Shades88 Avatar asked Aug 23 '12 13:08

Shades88


People also ask

Can we write MapReduce in Python?

MapReduce is written in Java but capable of running g in different languages such as Ruby, Python, and C++. Here we are going to use Python with the MR job package. We will count the number of reviews for each rating(1,2,3,4,5) in the dataset. Step 1: Transform raw data into key/value pairs in parallel.

What is MapReduce explain with example?

MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).

How do you implement MapReduce in Python?

The code in the below example implements the logic in mapper.py. py is the Python program that applies the logic in the reduce stage of WordCount. It delivers the results of mapper.py from stdin, sums the rates of each word, and writes the result to stdout. The code in the below example applies the logic in reducer.py.

How do I run a Python MapReduce in Hadoop?

Running Python MapReduce function To execute Python in Hadoop, we will need to use the Hadoop Streaming library to pipe the Python executable into the Java framework. As a result, we need to process the Python input from STDIN. Run ls and you should find mapper.py and reducer.py in the namenode container.


1 Answers

I went to stedrr log from the hadoop admin interface and saw that there was syntax error from python. Then I found that when I created hive table the field delimiter was tab. And in the split() i hadn't mentioned. So I changed it to split('\t') and it worked alright !

like image 200
Shades88 Avatar answered Oct 10 '22 04:10

Shades88