In Hadoop you can use the secondary-sort mechanism to sort the values before they are sent to the reducer.
The way this is done in Hadoop is that you add the value to sort by to the key and then have some custom group and key compare methods that hook into the sorting system.
So you'll need to have a key that consists essentially of both the real key and the value to sort by. In order to make this perform fast enough I'll need a way of creating a composite key that is also easy to decompose into the separate parts needed for the group and key compare methods.
What the smartest way is to do this. Is there an "out-of-the-box" Hadoop class that can assist me in this or do I have to create a separate key class for each map-reduce step?
How do I do this if the key actually is a composite that consists of several parts (also needed separately because of the partitioner)?
What do you guys recommend?
P.S. I wanted to add the tag "secondary-sort" but I don't have enough rep yet to do so.
I was running into this situation all the time and getting tired of writing custom composite key classes. I wrote a generic Tuple class which is a list of objects and can act as a composite key. The list may contain arbitrary number of objects of Java primitive wrapper types. It implements WritableComparable. The source can be viewed here
https://github.com/pranab/chombo/blob/master/src/main/java/org/chombo/util/Tuple.java
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With