I am just trying to confirm my understanding of difference between 0 reducer and identity reducer.
If we set the number of Reducer to 0 (by setting job. setNumreduceTasks(0)), then no reducer will execute and no aggregation will take place. In such case, we will prefer “Map-only job” in Hadoop. In Map-Only job, the map does all task with its InputSplit and the reducer do no job.
Identity Reducer is the default reducer in Hadoop old API. When no reducer class is set by job. setReducerClass() method in Driver class, Identity reducer is used as the default reducer. It doesn't provide any processing on the input, it will flush whatever input key value pair is fed to it as output.
Yes, we can set the Number of Reducer to zero. This means it is map only. The data is not sorted and directly stored in HDFS. If we want the output from mapper to be sorted ,we can use Identity reducer.
You understanding is correct. I would define it as following: If you do not need sorting of map results - you set 0 reduced,and the job is called map only.
If you need to sort the mapping results, but do not need any aggregation - you choose identity reducer.
And to complete the picture we have a third case : we do need aggregation and, in this case we need reducer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With