I am trying to write a new Hadoop job for input data that is somewhat skewed. An analogy for this would be the word count example in Hadoop tutorial except lets say one particular word is present lot of times.
I want to have a partition function where this one key will be mapped to multiple reducers and remaining keys according to their usual hash paritioning. Is this possible?
Thanks in advance.
Create a custom partitioner that sends all messages with even key to the first partition and odd key to the second partition. This example will help you understand the process of implementing custom partitioner. To execute this example, you must create a topic with just two partitions.
The default partitioner uses the hash value of the key and the total number of partitions on a topic to determine the partition number. If you increase a partition number, then the default partitioner will return different numbers evenly if you provide the same key. Now, you might have questions as to how to solve this problem?
If a producer doesn’t provide a partition number, but it provides a key, choose a partition based on a hash value of the key. When no partition number or key is present, pick a partition in a round-robin fashion. So, you can use the default partitioner in three scenarios:
It uses the hashCode () method of the key objects modulo the number of partitions total to determine which partition to send a given (key, value) pair to. Partitioner provides the getPartition () method that you can implement yourself if you want to declare the custom partition for your job.
Don't think that in Hadoop the same key can be mapped to multiple reducers. But, the keys can be partitioned so that the reducers are more or less evenly loaded. For this, the input data should be sampled and the keys be partitioned appropriately. Check the Yahoo Paper for more details on the custom partitioner. The Yahoo Sort code is in the org.apache.hadoop.examples.terasort package.
Lets say Key A has 10 rows, B has 20 rows, C has 30 rows and D has 60 rows in the input. Then keys A,B,C can be sent to reducer 1 and key D can be sent to reducer 2 to make the load on the reducers evenly distributed. To partition the keys, input sampling has to be done to know how the keys are distributed.
Here are some more suggestions to make the Job complete faster.
Specify a Combiner on the JobConf to reduce the number of keys sent to the reducer. This also reduces the network traffic between the mapper and the reducer tasks. Although, there is no guarantee that the combiner will be invoked by the Hadoop framework.
Also, since the data is skewed (some of the keys are repeated again and again, lets say 'tools'), you might want to increase the # of reduce tasks to complete the Job faster. This ensures that while a reducer is processing 'tools', the other data is getting processed by other reducers in parallel.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With