I have some input data coming to the reducer with the value type Iterator . How can I sort this list of values to be ascending order? I need to sort them in order since they are time values, before processing all in the reducer.

To achieve sorting of reducer input values using hadoop's built-in features,you can do this: 1.Modify map output key - Append map output key with the corresponding value.Emit this composite key and the value from map.Since hadoop uses entire key by default for sorting, map output records will be sorted by (your old key + value). 2.Although sorting is done in step 1, you have manipulated the map output key in the process.Hadoop does Partitioning and Grouping based on the key by default. 3.Since you have modified the original key, you need to take care of modifying Partitioner and GroupingComparator to work based on the old key i.e., only the first part of your composite key. Partitioner - decides which key-value pairs land in the same Reducer instance GroupComparator - decides which key-value pairs among the ones that landed into the Reducer go to the same reduce method call. 4.Finally(and obviously) you need to extract the first part of input key in the reducer to get old key. If you need more(and a better) answer, turn to Hadoop Definitive Guide 3rd Edition -> chapter 8 -> sorting -> secondary sort

What you asked for is called Secondary Sort. In a nutshell - you extend the key to add "value sort key" to it and make hadoop to group by only "real key" but sort by both. Here is a very good explanation about the secondary sort: http://pkghosh.wordpress.com/2011/04/13/map-reduce-secondary-sort-does-it-all/

Sort reducer input iterator value before processing in Hadoop

2 Answers

To achieve sorting of reducer input values using hadoop's built-in features,you can do this:

1.Modify map output key - Append map output key with the corresponding value.Emit this composite key and the value from map.Since hadoop uses entire key by default for sorting, map output records will be sorted by (your old key + value).

2.Although sorting is done in step 1, you have manipulated the map output key in the process.Hadoop does Partitioning and Grouping based on the key by default.

3.Since you have modified the original key, you need to take care of modifying Partitioner and GroupingComparator to work based on the old key i.e., only the first part of your composite key. Partitioner - decides which key-value pairs land in the same Reducer instance
GroupComparator - decides which key-value pairs among the ones that landed into the Reducer go to the same reduce method call.

4.Finally(and obviously) you need to extract the first part of input key in the reducer to get old key.

If you need more(and a better) answer, turn to Hadoop Definitive Guide 3rd Edition -> chapter 8 -> sorting -> secondary sort

148

answered Oct 17 '22 07:10

Eswara Reddy Adapa

What you asked for is called Secondary Sort. In a nutshell - you extend the key to add "value sort key" to it and make hadoop to group by only "real key" but sort by both.
Here is a very good explanation about the secondary sort:
http://pkghosh.wordpress.com/2011/04/13/map-reduce-secondary-sort-does-it-all/

answered Oct 17 '22 07:10

David Gruzman

Related questions
                            
                                Hadoop Pig count number
                            
                                HDFS error: target already exists
                            
                                Hive is not showing tables
                            
                                Data visualisation tools availble on hive hadoop
                            
                                Create HIVE partitioned table HDFS location assistance
                            
                                How to rename huge amount of files in Hadoop/Spark?
                            
                                HDInsight: HBase or Azure Table Storage?
                            
                                Spark on embedded mode - user/hive/warehouse not found
                            
                                What happens if an RDD can't fit into memory in Spark? [duplicate]
                            
                                spark returns error libsnappyjava.so: failed to map segment from shared object: Operation not permitted
                            
                                Can you copy straight from Parquet/S3 to Redshift using Spark SQL/Hive/Presto?
                            
                                Hive: Best way to do incremetal updates on a main table
                            
                                start-all.sh, start-dfs.sh command not found
                            
                                Spark submit YARN mode HADOOP_CONF_DIR contents
                            
                                Merging small files in hadoop
                            
                                1 million sentences to save in DB - removing non-relevant English words
                            
                                Flatten tuple like a bag
                            
                                In Hadoop, where can i change default url ports 50070 and 50030 for namenode and jobtracker webpages
                            
                                Hadoop - Hive : Delete data which is older than specified no of days
                            
                                copy files from amazon s3 to hdfs using s3distcp fails

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sort reducer input iterator value before processing in Hadoop

Tags:

hadoop

mapreduce

Freddy

People also ask

2 Answers

Eswara Reddy Adapa

David Gruzman

Recent Activity

Donate For Us