where does combiners combine mapper outputs - in map phase or reduce phase in a Map-reduce job?

Tags:

I was under the impression that combiners are just like reducers that act on the local map task, That is it aggregates the results of individual Map task in order to reduce the network bandwidth for output transfer.

And from reading Hadoop- The definitive guide 3rd edition, my understanding seems correct.

From chapter 2 (page 34)

Combiner Functions Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify a combiner function to be run on the map output—the combiner func- tion’s output forms the input to the reduce function. Since the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all. In other words, calling the combiner function zero, one, or many times should produce the same output from the reducer.

So I tried the following on the wordcount problem:

job.setMapperClass(mapperClass);
job.setCombinerClass(reduceClass);
job.setNumReduceTasks(0);

Here is the counters:

14/07/18 10:40:15 INFO mapred.JobClient: Counters: 10
14/07/18 10:40:15 INFO mapred.JobClient:   File System Counters
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of bytes read=293
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of bytes written=75964
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of read operations=0
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of large read operations=0
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of write operations=0
14/07/18 10:40:15 INFO mapred.JobClient:   Map-Reduce Framework
14/07/18 10:40:15 INFO mapred.JobClient:     Map input records=7
14/07/18 10:40:15 INFO mapred.JobClient:     Map output records=16
14/07/18 10:40:15 INFO mapred.JobClient:     Input split bytes=125
14/07/18 10:40:15 INFO mapred.JobClient:     Spilled Records=0
14/07/18 10:40:15 INFO mapred.JobClient:     Total committed heap usage (bytes)=85000192

and here is part-m-00000:

hello   1
world   1
Hadoop  1
programming 1
mapreduce   1
wordcount   1
lets    1
see 1
if  1
this    1
works   1
12345678    1
hello   1
world   1
mapreduce   1
wordcount   1

so clearly no combiner is applied. I understand that Hadoop does not guarantee if a combiner will be called at all. But when I turn on the reduce phase, the combiner gets called.

WHY IS THIS BEHAVIOR?

Now when I read chapter 6 (page 208) on how MapReduce works. I see this paragraph described in the Reduce side.

The map outputs are copied to the reduce task JVM’s memory if they are small enough (the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which specifies the proportion of the heap to use for this purpose); otherwise, they are copied to disk. When the in-memory buffer reaches a threshold size (controlled by mapred.job.shuffle.merge.percent), or reaches a threshold number of map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk. If a combiner is specified it will be run during the merge to reduce the amount of data written to disk.

My inferences from this paragraph are : 1) Combiner is ALSO run during the reduce phase.

965

asked Jul 18 '14 17:07

brain storm

1 Answers

The main function of a combiner is optimization. It acts like a mini-reducer for most cases. From page 206 of the same book, chapter - How mapreduce works(The map side):

Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.

The quote from your question,

If a combiner is specified it will be run during the merge to reduce the amount of data written to disk.

Both the quotes indicate that a combiner is run primarily for compactness. Reducing the network bandwidth for output transfer is an advantage of this optimization.

Also, from the same book,

Recall that combiners may be run repeatedly over the input without affecting the final result. If there are only one or two spills, then the potential reduction in map output size is not worth the overhead in invoking the combiner, so it is not run again for this map output.

Meaning that hadoop doesn't guarentee how many times a combiner is run(could be zero also)

A combiner is never run for map-only jobs. It makes sense because, a combiner changes the map output. Also, since it doesn't guarantee the number of times it is called, the map output is not guaranteed to be the same either.

196

answered Oct 13 '22 02:10

Phani Rahul

Related questions
                            
                                Hadoop MapReduce, Java implementation questions
                            
                                how to attach debugger to remote Hadoop instance
                            
                                Error connecting: <class 'thrift.transport.TTransport.TTransportException'> Could not connect to localhost:21000
                            
                                What to use.. Impala on HDFS, or Impala on Hbase or just the Hbase?
                            
                                Flume NG and HDFS
                            
                                Why map and reduce run at the same time?
                            
                                How do I diff two tables in HBase
                            
                                Job and Task Scheduling In Hadoop
                            
                                how can I increase hdfs capacity
                            
                                Hadoop HA Namenode remote access
                            
                                Elephant-bird mvn package error
                            
                                Cloudera hadoop: not able to run Hadoop fs command and at same time HBase is not able to create directory on HDFS?
                            
                                What is Ideal number of reducers on Hadoop?
                            
                                HDFS file system namespace
                            
                                Hadoop vs Hazelcast
                            
                                Package org.apache.hadoop.ipc.protobuf empty
                            
                                could not delete files from dfs as safe mode is on
                            
                                loading 1GB data into hbase taking 1 hour
                            
                                hardware requirement for hadoop installation on Laptop
                            
                                hadoop - map reduce task and static variable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

where does combiners combine mapper outputs - in map phase or reduce phase in a Map-reduce job?

Tags:

hadoop

hadoop2

mapreduce

brain storm

People also ask

1 Answers

Phani Rahul

Recent Activity

Donate For Us