As given by Hadoop wiki to calculate ideal number of reducers is 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum)
but when to choose 0.95 and when 1.75? what is factor that considered while deciding this multiplier?
Let's say that you have 100 reduce slots available in your cluster.
With a load factor of 0.95 all the 95 reduce tasks will start at the same time, since there are enough reduce slots available for all the tasks. This means that no tasks will be waiting in the queue, until one of the rest finishes. I would recommend this option when the reduce tasks are "small", i.e., finish relatively fast, or they all require the same time, more or less.
On the other hand, with a load factor of 1.75, 100 reduce tasks will start at the same time, as many as the reduce slots available, and the 75 rest will be waiting in the queue, until a reduce slot becomes available. This offers better load balancing, since if some tasks are "heavier" than others, i.e., require more time, then they will not be the bottleneck of the job, since the other reduce slots, instead of finishing their tasks and waiting, will now be executing the tasks in the queue. This also lightens the load of each reduce task, since the data of the map output is spread to more tasks.
If I may express my opinion, I am not sure if these factors are ideal always. Often, I use a factor greater than 1.75 (sometimes even 4 or 5), since I am dealing with Big Data, and my data does not fit in each machine, unless I set this factor higher and load balancing is also better.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With