Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive transform using Python: Unable to initialize custom script

I'm trying to test Hive TRANSFORM by feeding a Python script as mapper. My hive script is:

add file  /full/path/to/mapper.py;

set mapred.job.queue.name=queue_name;

use my_database;

select transform(s.year, s.month, s.day, s.hour) 
using 'mapper.py' 
from my_table s limit 10; 

and my Python mapper script is simply trying to echo the input:

#!/usr/local/bin/python
import sys
for line in sys.stdin:
    print line

I have tried to run this with the following combinations:

  1. Removing the add file ... in the hive script and providing full path to mapper.py in the select ... statement

  2. Keeping the add file ... and the full path for mapper: /path/to/mapper.py

  3. Keeping the add file ... and relative path for mapper: ./mapper.py

  4. Tried selecting mapper output using AS clause (using 'mapper.py' as line)

So far, all of the above attempts have resulted in Hive reporting that it cannot initialize my custom script:

FAILED: Execution Error, return code 20000 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. Unable to initialize custom script.

I'm not able to understand the nature of this 'initialization.' Is Hive not able to

  1. find my script (i.e., a path issue)?
  2. locate the python executable (i.e., the #! shebang)

I'm following the "Custom map/reduce scripts" in the Hive tutorial.

like image 608
RDK Avatar asked Feb 20 '15 20:02

RDK


1 Answers

Resolved it by modifying my select... statement as

add file  /full/path/to/mapper.py;
select transform(s.year, s.month, s.day, s.hour) 
using ' python mapper.py' --<--- This line changed
from my_table s limit 10; 

Reference post

like image 65
RDK Avatar answered Nov 12 '22 20:11

RDK