Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort reducer input iterator value before processing in Hadoop

I have some input data coming to the reducer with the value type Iterator . How can I sort this list of values to be ascending order?

I need to sort them in order since they are time values, before processing all in the reducer.

like image 949
Freddy Avatar asked Feb 22 '13 04:02

Freddy


People also ask

Is reducer output sorted?

The output of the Reducer is not re-sorted. Called once at the end of the task. This method is called once for each key.

Can sorting be done with MapReduce?

Sorting. Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce implements sorting algorithm to automatically sort the output key-value pairs from the mapper by their keys. Sorting methods are implemented in the mapper class itself.

What is secondary sort in MapReduce?

A secondary sort problem relates to sorting values associated with a key in the reduce phase. Sometimes, it is called value-to-key conversion. The secondary sorting technique will enable us to sort the values (in ascending or descending order) passed to each reducer.

How does MapReduce sort algorithm work?

Sorting is the basic MapReduce algorithm that processes and analyzes the given data. The sorting algorithm is implemented by MapReduce to sort the output key-value pairs from the mapper with respect to their keys. Sorting methods are applied within the mapper class.


2 Answers

To achieve sorting of reducer input values using hadoop's built-in features,you can do this:

1.Modify map output key - Append map output key with the corresponding value.Emit this composite key and the value from map.Since hadoop uses entire key by default for sorting, map output records will be sorted by (your old key + value).

2.Although sorting is done in step 1, you have manipulated the map output key in the process.Hadoop does Partitioning and Grouping based on the key by default.

3.Since you have modified the original key, you need to take care of modifying Partitioner and GroupingComparator to work based on the old key i.e., only the first part of your composite key. Partitioner - decides which key-value pairs land in the same Reducer instance
GroupComparator - decides which key-value pairs among the ones that landed into the Reducer go to the same reduce method call.

4.Finally(and obviously) you need to extract the first part of input key in the reducer to get old key.

If you need more(and a better) answer, turn to Hadoop Definitive Guide 3rd Edition -> chapter 8 -> sorting -> secondary sort

like image 148
Eswara Reddy Adapa Avatar answered Oct 17 '22 07:10

Eswara Reddy Adapa


What you asked for is called Secondary Sort. In a nutshell - you extend the key to add "value sort key" to it and make hadoop to group by only "real key" but sort by both.
Here is a very good explanation about the secondary sort:
http://pkghosh.wordpress.com/2011/04/13/map-reduce-secondary-sort-does-it-all/

like image 32
David Gruzman Avatar answered Oct 17 '22 07:10

David Gruzman