Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

where does combiners combine mapper outputs - in map phase or reduce phase in a Map-reduce job?

I was under the impression that combiners are just like reducers that act on the local map task, That is it aggregates the results of individual Map task in order to reduce the network bandwidth for output transfer.

And from reading Hadoop- The definitive guide 3rd edition, my understanding seems correct.

From chapter 2 (page 34)

Combiner Functions Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify a combiner function to be run on the map output—the combiner func- tion’s output forms the input to the reduce function. Since the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all. In other words, calling the combiner function zero, one, or many times should produce the same output from the reducer.

So I tried the following on the wordcount problem:

job.setMapperClass(mapperClass);
job.setCombinerClass(reduceClass);
job.setNumReduceTasks(0);

Here is the counters:

14/07/18 10:40:15 INFO mapred.JobClient: Counters: 10
14/07/18 10:40:15 INFO mapred.JobClient:   File System Counters
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of bytes read=293
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of bytes written=75964
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of read operations=0
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of large read operations=0
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of write operations=0
14/07/18 10:40:15 INFO mapred.JobClient:   Map-Reduce Framework
14/07/18 10:40:15 INFO mapred.JobClient:     Map input records=7
14/07/18 10:40:15 INFO mapred.JobClient:     Map output records=16
14/07/18 10:40:15 INFO mapred.JobClient:     Input split bytes=125
14/07/18 10:40:15 INFO mapred.JobClient:     Spilled Records=0
14/07/18 10:40:15 INFO mapred.JobClient:     Total committed heap usage (bytes)=85000192

and here is part-m-00000:

hello   1
world   1
Hadoop  1
programming 1
mapreduce   1
wordcount   1
lets    1
see 1
if  1
this    1
works   1
12345678    1
hello   1
world   1
mapreduce   1
wordcount   1

so clearly no combiner is applied. I understand that Hadoop does not guarantee if a combiner will be called at all. But when I turn on the reduce phase, the combiner gets called.

WHY IS THIS BEHAVIOR?

Now when I read chapter 6 (page 208) on how MapReduce works. I see this paragraph described in the Reduce side.

The map outputs are copied to the reduce task JVM’s memory if they are small enough (the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which specifies the proportion of the heap to use for this purpose); otherwise, they are copied to disk. When the in-memory buffer reaches a threshold size (controlled by mapred.job.shuffle.merge.percent), or reaches a threshold number of map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk. If a combiner is specified it will be run during the merge to reduce the amount of data written to disk.

My inferences from this paragraph are : 1) Combiner is ALSO run during the reduce phase.

like image 965
brain storm Avatar asked Jul 18 '14 17:07

brain storm


People also ask

What are Partitionors and combiners in MapReduce?

The difference between a partitioner and a combiner is that the partitioner divides the data according to the number of reducers so that all the data in a single partition gets executed by a single reducer. However, the combiner functions similar to the reducer and processes the data in each partition.

What is the correct sequence of phases in a MapReduce job?

MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. Map stage − The map or mapper's job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS).

Which are the functions of the mapping phase and reduce phase in MapReduce?

MapReduce is a software framework and programming model used for processing huge amounts of data. MapReduce program work in two phases, namely, Map and Reduce. Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data.

How many combiners will work in MapReduce program?

If Combiner is specified in MapReduce job, any or zero number of combiners can run. Whether the combiner is invoked or not depends on the number of spill files generated by the map task.


1 Answers

The main function of a combiner is optimization. It acts like a mini-reducer for most cases. From page 206 of the same book, chapter - How mapreduce works(The map side):

Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.

The quote from your question,

If a combiner is specified it will be run during the merge to reduce the amount of data written to disk.

Both the quotes indicate that a combiner is run primarily for compactness. Reducing the network bandwidth for output transfer is an advantage of this optimization.

Also, from the same book,

Recall that combiners may be run repeatedly over the input without affecting the final result. If there are only one or two spills, then the potential reduction in map output size is not worth the overhead in invoking the combiner, so it is not run again for this map output.

Meaning that hadoop doesn't guarentee how many times a combiner is run(could be zero also)

A combiner is never run for map-only jobs. It makes sense because, a combiner changes the map output. Also, since it doesn't guarantee the number of times it is called, the map output is not guaranteed to be the same either.

like image 196
Phani Rahul Avatar answered Oct 13 '22 02:10

Phani Rahul