What Are the Pros and Cons of Running a Job in Hadoop Using Various Languages?

Tags:

I've been using either Pig or Java for Map Reduce exclusively for running jobs against a Hadoop cluster thus far. I've recently tried out using Python Map Reduce through the Hadoop streaming and that was pretty cool as well. All of these make sense to me, but I'm a little hazy on when I would want to use one implementation v.s. another. Java map reduce, I've been using basically exclusively when I need speed, but when would I ever want to use something like Python streaming instead of just writing out the same thing in fewer, more easily understandable lines in PIG/Hive? In short, what are the pros and cons to each?

361

asked Mar 05 '12 15:03

Eli

1 Answers

I will separately relate to Java vs Python and then separately relate to MR vs Hive / Pig - since i see it as two different issues
Hadoop is built around java and many of its capabilities available via Java API, and Hadoop mostly can be extended using java classes.

Hadoop do has capability to work with MR jobs created in other languages - it is called streaming. This model only allow us to define mapper and reducer with some restrictions not present in java. In the same time - input/output formats and other plugins do have to be written as java classes
So I would define decision making as following: a) Use Java, unless you have serious codebase you need to resue in Your MR job. b) Consider to use python when you need to create some simple ad hoc jobs.

Regarding Pig / Hive - it is also java centric systems of higher level. Hive can be used without any programming at all, but it can be is extended using java. Pig require java from the beginning. I think this systems are almost always preferable to MR jobs in cases when they can be appliaed. Usually these are cases when processing is SQL like.

Performance considerations between streaming vs native Java.
Streaming feeds input to the mapper via its input stream. It is interprocess communication which is inherently less efficient then in-process data passing between record reader and mapper in case of java.
I can make a following conclusions from above: a) In case of some light processing (like looking for substring, counting ...) this overhead can be significan and java solution will be more efficient.
b) In case of some heavy processing, which can be potentially implemented in some non-java language more efficiently - streaming based solution can have some edge.

Pig / Hive performance considerations.
Pig / Hive both implements primitives of the SQL processing. In other words - they implement elements of the execution plan in the RDBMS world. These implementations are good and well tuned. In the same time Hive (something I know better) is interpreter. It does not do code generation - it inteprpret execution plan within pre-built MR job(s). It mean that if you have sompe complex condtions and will write code specially for them - it have all chances to do much better then Hive - representing performance advantage of compiler vs interpeter.

128

answered Oct 14 '22 05:10

David Gruzman

Related questions
                            
                                Unable to connect with azure blob storage with local hadoop
                            
                                Hive : casting array<string> to array<int> in query
                            
                                Can we get all the column names from an HBase table?
                            
                                Towards limiting the big RDD
                            
                                How can I know spark-core version?
                            
                                Unable to load data in Hive partitioned table
                            
                                How to convert timestamp (with dot between second and millisecond) to date(yyyyMMdd) in Hive?
                            
                                Impala/Hive to get list of tables along with its size
                            
                                Setup and configuration of JanusGraph for a Spark cluster and Cassandra
                            
                                Immediate evaluation of CTE
                            
                                Spark Dataframe hanging on save
                            
                                Remote access to HDFS on Kubernetes
                            
                                Job 65 cancelled because SparkContext was shut down
                            
                                hadoop beginners question
                            
                                Should I prefer hadoop vs condor when working with R?
                            
                                Cassandra wih Hive
                            
                                How does hive/hadoop assures that each mapper works on data that is local for it?
                            
                                Hadoop Throws ClassCastException for the keytype of java.nio.ByteBuffer
                            
                                How do I install Cloudera Hue on Mac OS X Lion?
                            
                                Suggestions on distributing python data/code over worker nodes?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What Are the Pros and Cons of Running a Job in Hadoop Using Various Languages?

Tags:

hadoop

apache-pig

mapreduce

Eli

People also ask

1 Answers

David Gruzman

Recent Activity

Donate For Us