Can any one explain me how secondary sorting works in hadoop ?
Why must one use GroupingComparator
and how does it work in hadoop ?
I was going through the link given below and got doubt on how groupcomapator works.
Can any one explain me how grouping comparator works?
http://www.bigdataspeak.com/2013/02/hadoop-how-to-do-secondary-sort-on_25.html
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce implements sorting algorithm to automatically sort the output key-value pairs from the mapper by their keys. Sorting methods are implemented in the mapper class itself.
Sorting in MapReduceThe keys generated by the mapper are automatically sorted by MapReduce Framework, i.e. Before starting of reducer, all intermediate key-value pairs in MapReduce that are generated by mapper get sorted by key and not by value. Values passed to each reducer are not sorted; they can be in any order.
Secondary sorting means sorting to be done on two or more field values of the same or different data types. Additionally we might also have deal with grouping and partitioning. The best and most efficient way to do secondary sorting in Hadoop is by writing our own key class.
MapReduce is a Hadoop framework used for writing applications that can process vast amounts of data on large clusters. It can also be called a programming model in which we can process large datasets across computer clusters. This application allows data to be stored in a distributed form.
I find it easy to understand certain concepts with help of diagrams and this is certainly one of them.
Lets assume that our secondary sorting is on a composite key made out of Last Name and First Name.
With the composite key out of the way, now lets look at the secondary sorting mechanism
The partitioner and the group comparator use only natural key, the partitioner uses it to channel all records with the same natural key to a single reducer. This partitioning happens in the Map Phase, data from various Map tasks are received by reducers where they are grouped and then sent to the reduce method. This grouping is where the group comparator comes into picture, if we would not have specified a custom group comparator then Hadoop would have used the default implementation which would have considered the entire composite key, which would have lead to incorrect results.
Overview of MR steps
Grouping Comparator
Once the data reaches a reducer, all data is grouped by key. Since we have a composite key, we need to make sure records are grouped solely by the natural key. This is accomplished by writing a custom GroupPartitioner. We have a Comparator object only considering the yearMonth field of the TemperaturePair class for the purposes of grouping the records together.
public class YearMonthGroupingComparator extends WritableComparator { public YearMonthGroupingComparator() { super(TemperaturePair.class, true); } @Override public int compare(WritableComparable tp1, WritableComparable tp2) { TemperaturePair temperaturePair = (TemperaturePair) tp1; TemperaturePair temperaturePair2 = (TemperaturePair) tp2; return temperaturePair.getYearMonth().compareTo(temperaturePair2.getYearMonth()); } }
Here are the results of running our secondary sort job:
new-host-2:sbin bbejeck$ hdfs dfs -cat secondary-sort/part-r-00000
190101 -206
190102 -333
190103 -272
190104 -61
190105 -33
190106 44
190107 72
190108 44
190109 17
190110 -33
190111 -217
190112 -300
While sorting data by value may not be a common need, it’s a nice tool to have in your back pocket when needed. Also, we have been able to take a deeper look at the inner workings of Hadoop by working with custom partitioners and group partitioners. Refer this link also..What is the use of grouping comparator in hadoop map reduce
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With