I think I have a fair understanding of the MapReduce programming model in general, but even after reading the original paper and some other sources many details are unclear to me, especially regarding the partitioning of the intermediate results.
I will quickly summarize my understanding of MapReduce so far: We have a potentially very large input data set, which is automatically split up into M different pieces by the MR-Framework. For each piece, the framework schedules one map task which is executed by one of the available processors/machines in my cluster. Each of the M map tasks outputs a set of Key-Value-Pairs, which is stored locally on the same machine that executed this map task. Each machine divides its disk into R partitions and distributes its computed intermediate key value pairs based on the intermediate keys among the partitions. Then, the framework starts for each distinct intermediate key one reduce task which is again executed by any of the available machines.
Now my questions are:
Conceptually it is clear what the input and outputs of the map and reduce functions/tasks are. But I think I haven’t yet understood MapReduce on the technical level. Could somebody please help me understanding?
A partitioner works like a condition in processing an input dataset. The partition phase takes place after the Map phase and before the Reduce phase. The number of partitioners is equal to the number of reducers. That means a partitioner will divide the data according to the number of reducers.
What is Hadoop Partitioner? Partitioner in MapReduce job execution controls the partitioning of the keys of the intermediate map-outputs. With the help of hash function, key (or a subset of the key) derives the partition. The total number of partitions is equal to the number of reduce tasks.
Partitioning is used to make solving maths problems involving large numbers easier by separating them into smaller units. For example, 782 can be partitioned into: 700 + 80 + 2. It helps kids see the true value of each digit.
MapReduce determines when the job starts, how many partitions it will divide the data into. Hence the number of partitions depends on the number of reducers in the program. For example, if 20 reduce tasks are running in a program, then there will be 20 partitioners to feed the data to each of the 20 reducers.
It's also important to understand the spread and variation of your intermediate key as it is hashed and modulo'd (if using the default HashPartitioner) to determine which reduce partition should process that key. Say you had an even number of reducer tasks (10), and output keys that always hashed to an even number - then in this case the modulo of these hashs numbers and 10 will always be an even number, meaning that the odd numbered reducers would never process any data.
Addendum to what Chris said,
Basically, a partitioner class in Hadoop (e.g. Default HashPartitioner)
has to implement this function,
int getPartition(K key, V value, int numReduceTasks)
This function is responsible for returning you the partition number and you get the number of reducers you fixed when starting the job from the numReduceTasks variable, as seen for in the HashPartitioner.
Based on what integer the above function return, Hadoop selects node where the reduce task for a particular key should run.
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With