Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Pig/Hive for data processing instead of direct java map reduce code?

(Even more basic than Difference between Pig and Hive? Why have both?)

I have a data processing pipeline written in several Java map-reduce tasks over Hadoop (my own custom code, derived from Hadoop's Mapper and Reducer). It's a series of basic operations such as join, inverse, sort and group by. My code is involved and not very generic.

What are the pros and cons of continuing this admittedly development-intensive approach vs. migrating everything to Pig/Hive with several UDFs? which jobs won't I be able to execute? will I suffer a performance degradation (working with 100s of TB)? will I lose ability to tweak and debug my code when maintaining? will I be able to pipeline part of the jobs as Java map-reduce and use their input-output with my Pig/Hive jobs?

like image 928
ihadanny Avatar asked Nov 07 '11 14:11

ihadanny


2 Answers

Reference Twitter : Typically a Pig script is 5% of the code of native map/reduce written in about 5% of the time. However, queries typically take between 110-150% the time to execute that a native map/reduce job would have taken. But of course, if there is a routine that is highly performance sensitive they still have the option to hand-code the native map/reduce functions directly.

The above reference also talks about pros and cons of Pig over developing applications in MapReduce.

As with any higher level language or abstraction, flexibility and performance is lost with Pig/Hive at the expense of developer productivity.

like image 163
Praveen Sripati Avatar answered Nov 03 '22 00:11

Praveen Sripati


In this paper as of 2009 it is stated that Pig runs 1.5 times slower than plain MapReduce. It is expected that higher level tools built on top of Hadoop perform slower than plain MapReduce, however it is true that in order to have MapReduce perform optimally an advanced user that writes a lot of boilerplate code is needed (e.g. binary comparators).

I find it relevant to mention a new API called Pangool (which I'm a developer of) that aims to replace the plain Hadoop MapReduce API by making a lot of things easier to code and understand (secondary sort, reduce-side joins). Pangool doesn't impose a performance overhead (barely 5% as of its first benchmark) and retains all the flexibilty of the original MapRed API.

like image 44
Pere Avatar answered Nov 02 '22 23:11

Pere