how to sort numerically in hadoop's shuffle/sort phase?

Tags:

sorting

hadoop

The data looks like this, first field is a number,

3 ...
1 ...
2 ...
11 ...

And I want to sort these lines according to the first field numerically instead of alphabetically, which means after sorting it should look like this,

1 ...
2 ...
3 ...
11 ...

But hadoop keeps giving me this,

1 ...
11 ...
2 ...
3 ...

How do correct it?

443

asked Nov 11 '12 13:11

Alcott

1 Answers

Assuming you are using Hadoop Streaming, you need to use the KeyFieldBasedComparator class.

-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should be added to streaming command
You need to provide type of sorting required using mapred.text.key.comparator.options. Some useful ones are -n : numeric sort, -r : reverse sort

EXAMPLE :

Create an identity mapper and reducer with the following code

This is the mapper.py & reducer.py

#!/usr/bin/env python
import sys
for line in sys.stdin:    
    print "%s" % (line.strip())

This is the input.txt

This is the Streaming command

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar 
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator 
-D  mapred.text.key.comparator.options=-n 
-input /user/input.txt 
-output /user/output.txt 
-file ~/mapper.py 
-mapper ~/mapper.py 
-file ~/reducer.py 
-reducer ~/reducer.py

And you will get the required output

NOTE :

I have used a simple one key input. If however you have multiple keys and/or partitions, you will have to edit mapred.text.key.comparator.options as needed. Since I do not know your use case , my example is limited to this
Identity mapper is needed since you will need atleast one mapper for a MR job to run.
Identity reducer is needed since shuffle/sort phase will not work if it is a pure map only job.

answered Sep 19 '22 15:09

Nicole Hu

Related questions
                            
                                High performance "contains" search in list of strings in C#
                            
                                Locale based sort in Javascript, sort accented letters and other variants in a predefined way
                            
                                How can I sort a List several different ways in a JSP?
                            
                                Fastest way to check if an array is sorted
                            
                                How can I format a column of numbers in an emacs org mode table?
                            
                                Python: sort this dictionary (dict in dict)
                            
                                Java 8+ stream: Check if list is in the correct order for two fields of my object-instances
                            
                                The Most Efficient Algorithm to Find First Prefix-Match From a Sorted String Array?
                            
                                Immutable value only has mutating members
                            
                                Pandas sort_values does not sort numbers correctly
                            
                                Find oldest file in a folder using PHP
                            
                                Get N max numbers from a List<int> using lambda expression
                            
                                Why is List<T>.Sort using Comparer<int>.Default more than twice as fast as an equivalent custom comparer?
                            
                                How does random shuffling in quick sort help in increasing the efficiency of the code?
                            
                                How to sort by two fields (one numeric, one string) at the same time using the built in "sort" program?
                            
                                How to use vutify's custom sort?
                            
                                How can I cluster a graph in Python?
                            
                                Does someone really sort terabytes of data?
                            
                                Sorting a List of Strings numerically (1,2,...,9,10 instead of 1,10,2)
                            
                                Sort filenames without leading zeros

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With