Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom Map Reduce Program on Hive, what's the Rule? How about input and output?

I got stuck for a few days because I want to create a custom map reduce program based on my query on hive, I found not many examples after googling and I'm still confused about the rule.

What is the rule to create my custom mapreduce program, how about the mapper and reducer class?

Can anyone provide any solution?

I want to develop this program in Java, but I'm still stuck ,and then when formatting output in collector, how do I format the result in mapper and reducer class?

Does anybody want to give me some example and explanation about this kind of stuff?

like image 755
fahmi Avatar asked May 30 '11 16:05

fahmi


People also ask

What is input and output of MapReduce?

They are sequenced one after the other. The Map function takes input from the disk as <key,value> pairs, processes them, and produces another set of intermediate <key,value> pairs as output. The Reduce function also takes inputs as <key,value> pairs, and produces <key,value> pairs as output.

What is the input of a MapReduce?

Inputs and Outputs The MapReduce model operates on <key, value> pairs. It views the input to the jobs as a set of <key, value> pairs and produces a different set of <key, value> pairs as the output of the jobs. Data input is supported by two classes in this framework, namely InputFormat and RecordReader.

How does Hive MapReduce work?

Hive Architecture An SQL query gets converted into a MapReduce app by going through the following process: The Hive client or UI submits a query to the driver. The driver then submits the query to the Hive compiler, which generates a query plan and converts the SQL into MapReduce tasks.

What interfaces must the inputs and outputs of MapReduce implement?

Inputs and Outputs The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.


1 Answers

There are basically 2 ways to add custom mappers/reducers to hive queries.

  1. using transform

SELECT TRANSFORM(stuff1, stuff2) FROM table1 USING 'script' AS thing1, thing2

where stuff1, stuff2 are the fields in table1 and script is any executable which accepts the format i describe later. thing1, thing2 are the outputs from script

  1. using map and reduce
FROM (
    FROM table
    MAP table.f1 table.f2
    USING 'map_script'
    AS mp1, mp2
    CLUSTER BY mp1) map_output
  INSERT OVERWRITE TABLE someothertable
    REDUCE map_output.mp1, map_output.mp2
    USING 'reduce_script'
    AS reducef1, reducef2;

This is slightly more complicated but gives more control. There are 2 parts to this. In the first part the mapper script will receive data from table and map it to fields mp1 and mp2. these are then passed on to reduce_script, this script will receive sorted output on the key, which we have specified in CLUSTER BY mp1. mind you, more than one key will be handled by one reducer. The output of the reduce script will go to table someothertable

Now all these scripts follow a simple pattern. they will read line by line from stdin. The fields will be \t separated and they will write back to stdout, in the same manner ( fields separated by '\t' )

Check out this blog, there are some nice examples.

http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html

http://dev.bizo.com/2009/10/reduce-scripts-in-hive.html

like image 71
Rohan Monga Avatar answered Oct 05 '22 19:10

Rohan Monga