Custom partitioner example

Tags:

I am trying to write a new Hadoop job for input data that is somewhat skewed. An analogy for this would be the word count example in Hadoop tutorial except lets say one particular word is present lot of times.

I want to have a partition function where this one key will be mapped to multiple reducers and remaining keys according to their usual hash paritioning. Is this possible?

Thanks in advance.

334

asked Oct 24 '11 23:10

Sainath Mallidi

1 Answers

Don't think that in Hadoop the same key can be mapped to multiple reducers. But, the keys can be partitioned so that the reducers are more or less evenly loaded. For this, the input data should be sampled and the keys be partitioned appropriately. Check the Yahoo Paper for more details on the custom partitioner. The Yahoo Sort code is in the org.apache.hadoop.examples.terasort package.

Lets say Key A has 10 rows, B has 20 rows, C has 30 rows and D has 60 rows in the input. Then keys A,B,C can be sent to reducer 1 and key D can be sent to reducer 2 to make the load on the reducers evenly distributed. To partition the keys, input sampling has to be done to know how the keys are distributed.

Here are some more suggestions to make the Job complete faster.

Specify a Combiner on the JobConf to reduce the number of keys sent to the reducer. This also reduces the network traffic between the mapper and the reducer tasks. Although, there is no guarantee that the combiner will be invoked by the Hadoop framework.

Also, since the data is skewed (some of the keys are repeated again and again, lets say 'tools'), you might want to increase the # of reduce tasks to complete the Job faster. This ensures that while a reducer is processing 'tools', the other data is getting processed by other reducers in parallel.

173

answered Nov 15 '22 05:11

Praveen Sripati

Related questions
                            
                                How to add a typesafe config file which is located on HDFS to spark-submit (cluster-mode)?
                            
                                Spark not leveraging hdfs partitioning with parquet
                            
                                hbase how to choose pre split strategies and how its affect your rowkeys
                            
                                Querying Hbase efficiently
                            
                                Write Parquet format to HDFS using Java API with out using Avro and MR
                            
                                HBase: How to specify multiple prefix filters in a single scan operation
                            
                                how does YARN "Fair Scheduler" work with spark-submit configuration parameter
                            
                                Where does the Hive data gets stored?
                            
                                Yarn get logs with rest API
                            
                                Use Data Lake or Blob on HDInsights cluster on Azure
                            
                                Unable to run mapreduce wordcount
                            
                                How to fix error on pyspark EMR Notebook - AnalysisException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
                            
                                How To Get Local Spark on AWS to Write to S3
                            
                                Connecting to remote Dataproc master in SparkSession
                            
                                How to load json snappy compressed in HIVE
                            
                                Hadoop or Hadoop Streaming for MapReduce on AWS
                            
                                Network bandwidth bottleneck for sorting of mapreduce intermediate keys?
                            
                                Hadoop 0.20.2 Eclipse plugin not fully functioning - can't 'Run on Hadoop'
                            
                                How can I troubshoot this Hadoop filesystem installation error?
                            
                                In Hive, does "Load data local inpath" overwrite existing data or append?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Custom partitioner example

Tags:

hadoop

mapreduce

partitioning

Sainath Mallidi

People also ask

1 Answers

Praveen Sripati

Recent Activity

Donate For Us