Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Streaming or custom Jar in Hadoop

I'm running a streaming job in Hadoop (on Amazon's EMR) with the mapper and reducer written in Python. I want to know about the speed gains I would experience if I implement the same mapper and reducer in Java (or use Pig).

In particular, I'm looking for people's experiences on migrating from streaming to custom jar deployments and/or Pig and also documents containing benchmark comparisons of these options. I found this question, but the answers are not specific enough for me. I'm not looking for comparisons between Java and Python, but comparisons between custom jar deployment in Hadoop and Python-based streaming.

My job is reading NGram counts from the Google Books NGgram dataset and computing aggregate measures. It seems like CPU utilization on the compute nodes are close to 100%. (I would like to hear your opinions about the differences of having CPU-bound or an IO-bound job, as well).

Thanks!

Amaç

like image 624
Ruggiero Spearman Avatar asked Jul 29 '11 12:07

Ruggiero Spearman


1 Answers

Why consider deploying custom jars ?

  • Ability to use more powerful custom Input formats. For streaming jobs, even if you use pluggable input/output like it's mentioned here, you are limited to the key and value(s) to your mapper/reducer being a text/string. You would need to expend some amount of CPU cycles to convert to your required type.
  • Ive also heard that Hadoop can be smart about reusing JVMs across multiple Jobs which wont be possible when streaming (can't confirm this)

When to use pig ?

  • Pig Latin is pretty cool and is a much higher level data flow language than java/python or perl. Your Pig scripts WILL tend to be much smaller than an equivalent task written any of the other languages

When to NOT use pig ?

  • Even though pig is pretty good at figuring out by itself how many maps/reduce and when to spawn a map or reduce and a myriad of such things, if you are dead sure how many maps/reduce you need and you have some very specific computation you need to do within your Map/reduce functions and you are very specific about performance, then you should consider deploying your own jars. This link shows that pig can lag native hadoop M/R in performance. You could also take a look at writing your own Pig UDFs which isolate some compute intensive function (and possibly even use JNI to call some native C/C++ code within the UDF)

A Note on IO and CPU bound jobs :

  • Technically speaking, the whole point of hadoop and map reduce is to parallelize compute intensive functions, so i'd presume your map and reduce jobs are compute intensive. The only time the Hadoop subsystem is busy doing IO is in between the map and reduce phase when data is sent across the network. Also if you have large amount of data and you have manually configured too few maps and reduces resulting in spills to disk (although too many tasks will results in too much time spent starting / stopping JVMs and too many small files). A streaming Job would also have the additional overhead of starting a Python/Perl VM and have data being copied to and fro between the JVM and the scripting VM.
like image 61
arun_suresh Avatar answered Oct 10 '22 12:10

arun_suresh