I want to ask about Hadoop partitioner ,is it implemented within Mappers?. How to measure the performance of using the default hash partitioner - Is there better partitioner to reducing data skew?
Thanks
Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job.
Advertisements. A partitioner works like a condition in processing an input dataset. The partition phase takes place after the Map phase and before the Reduce phase. The number of partitioners is equal to the number of reducers.
The default partitioner in Hadoop is the HashPartitioner which has a method called getPartition . It takes key.
The difference between a partitioner and a combiner is that the partitioner divides the data according to the number of reducers so that all the data in a single partition gets executed by a single reducer. However, the combiner functions similar to the reducer and processes the data in each partition.
Partitioner is not within Mapper.
Below is the process that happens in each Mapper -
Below is process that happens in each Reducer
Now each Reducer collects all the files from each mapper, it moves into sort/merge phase(sort is already done at mapper side) which merges all the map outputs with maintaining sort ordering.
During reduce phase reduce function is invoked for each key in the sorted output.
Below is the code, illustrating the actual process of partition of keys. getpartition() will return the partition number/reducer the particular key has to be sent to based on its hash code. Hashcode has to unique for each key and across the landscape Hashcode should be unique and same for a key. For this purpose hadoop implements its own Hashcode for its key instead of using java default hash code.
Partition keys by their hashCode().
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
Partitioner is a key component in between Mappers and Reducers. It distributes the maps emitted data among the Reducers.
Partitioner runs within every Map Task JVM (java process).
The default partitioner HashPartitioner
works based on Hash function and it is very faster compared other partitioner like TotalOrderPartitioner
. It runs hash function on every map output key i.e.:
Reduce_Number = (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
To check the performance of Hash Partitioner, use Reduce task counters and see how the distribution happened among the reducers.
Hash Partitioner is basic partitioner and it doesn't suit for processing data with high skewness.
To address the data skew problems, we need to write the custom partitioner class extending Partitioner.java
class from MapReduce API.
The example for custom partitioner is like RandomPartitioner
. It is one of the best ways to distribute the skewed data among the reducers evenly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With