I want to ask about Hadoop partitioner ,is it implemented within Mappers?. How to measure the performance of using the default hash partitioner - Is there better partitioner to reducing data skew? Thanks

Partitioner is not within Mapper. Below is the process that happens in each Mapper - <ul> <li>Each map task writes its output to a circular buffer memory (and not to disk). When the buffer reaches a threshold, a background thread starts to spill the contents to disk. [Buffer size is governed by mapreduce.task.io.sort.mb property & defaults to 100 MB and spill governed by mapreduce.io.sort.spill.percent property & defaults to 0.08 or 80%]. Before spilling to disk data is Partitioned corresponding to the reducers they will be sent to Performs in-memory sort by key within each partition</li> <li>Run combiner function on outcome of each sort (enabling less data to write & transfer, this needs to be done specifically)</li> <li>Compress (optional) [mapred.compress.map.output=true; mapred.map.output.compression.codec=CodecName]</li> <li>Writes to disk and The output file’s partitions are made available to reducers over HTTP.</li> </ul> Below is process that happens in each Reducer <ul> <li>Now each Reducer collects all the files from each mapper, it moves into sort/merge phase(sort is already done at mapper side) which merges all the map outputs with maintaining sort ordering.</li> <li>During reduce phase reduce function is invoked for each key in the sorted output.</li> </ul> <img src="https://i.stack.imgur.com/29YdY.gif" alt="enter image description here"> Below is the code, illustrating the actual process of partition of keys. getpartition() will return the partition number/reducer the particular key has to be sent to based on its hash code. Hashcode has to unique for each key and across the landscape Hashcode should be unique and same for a key. For this purpose hadoop implements its own Hashcode for its key instead of using java default hash code. <pre class="prettyprint"><code> Partition keys by their hashCode(). public class HashPartitioner<K, V> extends Partitioner<K, V> { public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; } } </code></pre>

Partitioner is a key component in between Mappers and Reducers. It distributes the maps emitted data among the Reducers. Partitioner runs within every Map Task JVM (java process). The default partitioner <code>HashPartitioner</code> works based on Hash function and it is very faster compared other partitioner like <code>TotalOrderPartitioner</code>. It runs hash function on every map output key i.e.: <pre class="prettyprint"><code>Reduce_Number = (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; </code></pre> To check the performance of Hash Partitioner, use Reduce task counters and see how the distribution happened among the reducers. Hash Partitioner is basic partitioner and it doesn't suit for processing data with high skewness. To address the data skew problems, we need to write the custom partitioner class extending <code>Partitioner.java</code> class from MapReduce API. The example for custom partitioner is like <code>RandomPartitioner</code>. It is one of the best ways to distribute the skewed data among the reducers evenly.

Hadoop partitioner

2 Answers

Partitioner is not within Mapper.

Below is the process that happens in each Mapper -

Each map task writes its output to a circular buffer memory (and not to disk). When the buffer reaches a threshold, a background thread starts to spill the contents to disk. [Buffer size is governed by mapreduce.task.io.sort.mb property & defaults to 100 MB and spill governed by mapreduce.io.sort.spill.percent property & defaults to 0.08 or 80%]. Before spilling to disk data is Partitioned corresponding to the reducers they will be sent to Performs in-memory sort by key within each partition
Run combiner function on outcome of each sort (enabling less data to write & transfer, this needs to be done specifically)
Compress (optional) [mapred.compress.map.output=true; mapred.map.output.compression.codec=CodecName]
Writes to disk and The output file’s partitions are made available to reducers over HTTP.

Below is process that happens in each Reducer

Now each Reducer collects all the files from each mapper, it moves into sort/merge phase(sort is already done at mapper side) which merges all the map outputs with maintaining sort ordering.
During reduce phase reduce function is invoked for each key in the sorted output.

enter image description here

Below is the code, illustrating the actual process of partition of keys. getpartition() will return the partition number/reducer the particular key has to be sent to based on its hash code. Hashcode has to unique for each key and across the landscape Hashcode should be unique and same for a key. For this purpose hadoop implements its own Hashcode for its key instead of using java default hash code.

 Partition keys by their hashCode(). 

        public class HashPartitioner<K, V> extends Partitioner<K, V> {
        public int getPartition(K key, V value,
                                 int numReduceTasks) {
           return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
       }

       }

196

answered Oct 18 '22 23:10

Karthik

Partitioner is a key component in between Mappers and Reducers. It distributes the maps emitted data among the Reducers.

Partitioner runs within every Map Task JVM (java process).

The default partitioner HashPartitioner works based on Hash function and it is very faster compared other partitioner like TotalOrderPartitioner. It runs hash function on every map output key i.e.:

Reduce_Number = (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;

To check the performance of Hash Partitioner, use Reduce task counters and see how the distribution happened among the reducers.

Hash Partitioner is basic partitioner and it doesn't suit for processing data with high skewness.

To address the data skew problems, we need to write the custom partitioner class extending Partitioner.java class from MapReduce API.

The example for custom partitioner is like RandomPartitioner. It is one of the best ways to distribute the skewed data among the reducers evenly.

answered Oct 18 '22 22:10

Naga

Related questions
                            
                                Does throwing an exception in an EvalFunc pig UDF skip just that line, or stop completely?
                            
                                ERROR: org.apache.hadoop.hbase.MasterNotRunningException: null+hbase+hadoop
                            
                                Ubuntu cluster management
                            
                                Why do we need to set the output key/value class explicitly in the Hadoop program?
                            
                                Hadoop MapReduce intermediate output
                            
                                Can Hadoop distribute tasks and code base?
                            
                                How to parse CustomWritable from text in Hadoop
                            
                                Computing user similarity using mahout mapreduce
                            
                                Is it possible to send parts of the mapper out put to reducer, while just writing the other part to HDFS, in hadoop?
                            
                                Unable to start CDH4 secondary name node: Invalid URI for NameNode address
                            
                                Testing multiple outputs with MRUnit
                            
                                java.lang.OutOfMemoryError: unable to create new native thread for big data set
                            
                                Can OLAP CUBE be done in HBase?
                            
                                Hadoop nodes die (crash) after a while
                            
                                Exception in createBlockOutputStream when copying data into HDFS
                            
                                HBase Error : zookeeper.znode.parent mismatch
                            
                                Can the same Zookeeper instance be used by number of services?
                            
                                In-depth understanding of internal working of map phase in a Map reduce job in hadoop?
                            
                                What is the difference between GROUP and COGROUP in PIG?
                            
                                how to expend array values in rows!! using Hive SQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hadoop partitioner

Tags:

hadoop

mapreduce

partitioner

Nada Ghanem

People also ask

2 Answers

Karthik

Naga

Recent Activity

Donate For Us