Hadoop combiner sort phase

Tags:

When running a MapReduce job with a specified combiner, is the combiner run during the sort phase? I understand that the combiner is run on mapper output for each spill, but it seems like it would also be beneficial to run during intermediate steps when merge sorting. I'm assuming here that in some stages of the sort, mapper output for some equivalent keys is held in memory at some point.

If this doesn't currently happen, is there a particular reason, or just something which hasn't been implemented?

Thanks in advance!

783

asked Oct 19 '11 18:10

Michael Mior

2 Answers

Combiners are there to save network bandwidth.

The mapoutput directly gets sorted:

sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);

This happens right after the real mapping is done. During iteration through the buffer it checks if there has a combiner been set and if yes it combines the records. If not, it directly spills onto disk.

The important parts are in the MapTask, if you'd like to see it for yourself.

    sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);
    // some fields
    for (int i = 0; i < partitions; ++i) {
        // check if configured
        if (combinerRunner == null) {
          // spill directly
        } else {
            combinerRunner.combine(kvIter, combineCollector);
        }
    }

This is the right stage to save the disk space and the network bandwidth, because it is very likely that the output has to be transfered. During the merge/shuffle/sort phase it is not beneficial because then you have to crunch more amounts of data in comparision with the combiner run at map finish time.

Note the sort-phase which is shown in the web interface is misleading. It is just pure merging.

138

answered Oct 24 '22 16:10

Thomas Jungblut

There are two opportunities for running the Combiner, both on the map side of processing. (A very good online reference is from Tom White's "Hadoop: The Definitive Guide" - https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort )

The first opportunity comes on the map side after completing the in-memory sort by key of each partition, and before writing those sorted data to disk. The motivation for running the Combiner at this point is to reduce the amount of data ultimately written to local storage. By running the Combiner here, we also reduce the amount of data that will need to be merged and sorted in the next step. So to the original question posted, yes, the Combiner is already being applied at this early step.

The second opportunity comes right after merging and sorting the spill files. In this case, the motivation for running the Combiner is to reduce the amount of data ultimately sent over the network to the reducers. This stage benefits from the earlier application of the Combiner, which may have already reduced the amount of data to be processed by this step.

answered Oct 24 '22 15:10

user3344305

Related questions
                            
                                file path in hdfs
                            
                                HDFS access from remote host through Java API, user authentication
                            
                                How to use sqoop to export the default hive delimited output?
                            
                                Wrong result for count(*) in hive table
                            
                                In Spark is counting the records in an RDD expensive task?
                            
                                Setting permissions for cloudera hadoop
                            
                                Hadoop - get results from output files after reduce?
                            
                                Hive describe partitions to show partition url
                            
                                Hadoop error on Windows : java.lang.UnsatisfiedLinkError
                            
                                Hadoop DFS permission issue when running job
                            
                                What is Hue all about?
                            
                                How to mount HDFS on Ubuntu 14.04
                            
                                exporting Hive table to csv in hdfs
                            
                                Read ORC files directly from Spark shell
                            
                                Spark submit to yarn as a another user
                            
                                Should hadoop clusters run on identical hardware?
                            
                                hadoop vs teradata what is the difference
                            
                                Hadoop on cassandra database
                            
                                How to implement sort in hadoop?
                            
                                ClassNotFoundException: org.apache.hive.jdbc.HiveDriver

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hadoop combiner sort phase

Tags:

hadoop

mapreduce

combiners

Michael Mior

People also ask

2 Answers

Thomas Jungblut

user3344305

Recent Activity

Donate For Us