Difference between combiner and partitioner

Tags:

I am a newbie to MapReduce and I just can't figure out the difference in the partitioner and combiner. I know both run in the intermediate step between the map and reduce tasks and both reduce the amount of data to be processed by the reduce task. Please explain the difference using an example.

985

asked Jul 25 '16 08:07

harshit

2 Answers

First thing, agree with @Binary nerd s comment

Combiner can be viewed as mini-reducers in the map phase. They perform a local-reduce on the mapper results before they are distributed further. Once the Combiner functionality is executed, it is then passed on to the Reducer for further work.

where as Partitioner come into the picture when we are working on more than one Reducer. So, the partitioner decide which reducer is responsible for a particular key. They basically take the Mapper Result(if Combiner is used then Combiner Result) and send it to the responsible Reducer based on the key

With Combiner and Partitioner scenario : enter image description here

With Partitioner only scenario :

enter image description here

Examples :

Combiner Example
Partitioner Example :

The partitioning phase takes place after the map phase and before the reduce phase. The number of partitions is equal to the number of reducers. The data gets partitioned across the reducers according to the partitioning function . The difference between a partitioner and a combiner is that the partitioner divides the data according to the number of reducers so that all the data in a single partition gets executed by a single reducer. However, the combiner functions similar to the reducer and processes the data in each partition. The combiner is an optimization to the reducer. The default partitioning function is the hash partitioning function where the hashing is done on the key. However it might be useful to partition the data according to some other function of the key or the value. -- Source

answered Nov 03 '22 00:11

Ram Ghadiyaram

I think a little example can explain this very clearly and quickly.

Let's say you have a MapReduce Word Count job with 2 mappers and 1 reducer .

Without Combiner.

"hello hello there" => mapper1 => (hello, 1), (hello,1), (there,1)

"howdy howdy again" => mapper2 => (howdy, 1), (howdy,1), (again,1)

Both outputs get to the reducer => (again, 1), (hello, 2), (howdy, 2), (there, 1)

Using the Reducer as the Combiner

"hello hello there" => mapper1 with combiner => (hello, 2), (there,1)

"howdy howdy again" => mapper2 with combiner => (howdy, 2), (again,1)

Both outputs get to the reducer => (again, 1), (hello, 2), (howdy, 2), (there, 1)

Conclusion

The end result is the same, but when using a combiner, the map output is reduced already. In this example you only send 2 output pairs instead of 3 pairs to the reducer. So you gain IO/disk performance. This is useful when aggregating values.

The Combiner is actually a Reducer applied to the map() outputs.

If you take a look at the very first Apache MapReduce tutorial, which happens to be exactly the mapreduce example I just illustrated, you can see they use the reducer as the combiner :

job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);

answered Nov 03 '22 01:11

Nicomak

Related questions
                            
                                Difference between fs.defaultFS and fs.default.name
                            
                                How to optimize shuffling/sorting phase in a hadoop job
                            
                                Broken Pipe Error causes streaming Elastic MapReduce job on AWS to fail
                            
                                Hadoop streaming - remove trailing tab from reducer output
                            
                                Invalid URI for NameNode address
                            
                                Confusion about distributed cache in Hadoop
                            
                                hdfs Datanode denied communication with namenode because hostname cannot be resolved
                            
                                Oozie Job Error - java.io.IOException: configuration is not specified
                            
                                Get Columns in a specific Column Family for a row HBase
                            
                                Read a text file from HDFS line by line in mapper
                            
                                Connect Hive through Java JDBC
                            
                                Hive table locks
                            
                                Difference between job, application, task, task attempt logs in Hadoop, Oozie
                            
                                Namenode high availability client request
                            
                                How to pick random (small) data samples using Map/Reduce?
                            
                                Can I write a plain text HDFS (or local) file from a Spark program, not from an RDD?
                            
                                Problems with Hadoop distcp from HDFS to Amazon S3
                            
                                Hive : Insert overwrite multiple partitions
                            
                                Hive - LIKE Operator
                            
                                how to write case and group by in hive query

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference between combiner and partitioner

Tags:

hadoop

mapreduce

partitioner