Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop MapReduce: Clarification on number of reducers

In the MapReduce framework, one reducer is used for each key generated by the mapper.

So you would think that specifying the number of Reducers in Hadoop MapReduce wouldn't make any sense because it's dependent on the program. However, Hadoop allows you to specify the number of reducers to use (-D mapred.reduce.tasks=# of reducers).

What does this mean? Is the parameter value for number of reducers specifying how many machine resources go to the reducers instead of the number of actual reducers used?

like image 724
Bryan Avatar asked Mar 12 '14 18:03

Bryan


People also ask

How do you decide the number of reducers in MapReduce?

1) Number of reducers is same as number of partitions. 2) Number of reducers is 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node).

How many reducers should I use Hadoop?

The right number of Reducer seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>). With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish.

Can we have multiple reducers in MapReduce?

If there are lot of key-values to merge, a single reducer might take too much time. To avoid reducer machine becoming the bottleneck, we use multiple reducers. When you have multiple reducers, each node that is running mapper puts key-values in multiple buckets just after sorting.

What is the minimum count of reducer you can have in MapReduce program?

What can be the minimum number of reducers in map reduce? Yes. We can set the number of Reducer to 0 in Hadoop and it is valid configuration. job.


2 Answers

one reducer is used for each key generated by the mapper

This comment is not correct. One call to the reduce() method is done for each key grouped by the grouping comparator. A reducer (task) is a process that handles zero or more calls to reduce(). The property to which you refer is talking about the number of reducer tasks.

like image 187
Judge Mental Avatar answered Sep 17 '22 18:09

Judge Mental


To simplify @Judge Mental's (very accurate) answer a little bit: A reducer task can work on many keys at a time, but the mapred.reduce.tasks=# parameter declares how many simultaneous reducer tasks will run for a specific job.

An example if your mapred.reduce.tasks=10:
You have 2,000 keys, each key with 50 values (for an evenly distributed 10,000 k:v pairs). Each reducer should be roughly handling 200 keys (1,000 k:v pairs).

An example if your mapred.reduce.tasks=20:
You have 2,000 keys, each key with 50 values (for an evenly distributed 10,000 k:v pairs). Each reducer should be roughly handling 100 keys (500 k:v pairs).

In the example above, the fewer keys each reducer has to work with, the faster the overall job will be ... so long as you have the available reducer resources in the cluster, of course.

like image 27
JamCon Avatar answered Sep 18 '22 18:09

JamCon