Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to sort numerically in hadoop's shuffle/sort phase?

Tags:

sorting

hadoop

The data looks like this, first field is a number,

3 ...
1 ...
2 ...
11 ...

And I want to sort these lines according to the first field numerically instead of alphabetically, which means after sorting it should look like this,

1 ...
2 ...
3 ...
11 ...

But hadoop keeps giving me this,

1 ...
11 ...
2 ...
3 ...

How do correct it?

like image 443
Alcott Avatar asked Nov 11 '12 13:11

Alcott


People also ask

What happens in sort and shuffle phase?

Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer in MapReduce. Sort phase in MapReduce covers the merging and sorting of map outputs. Data from the mapper are grouped by the key, split among reducers, and sorted by the key. Every reducer obtains all values associated with the same key.

How do you sort shuffle?

Definition: A distribution sort algorithm that begins by removing the first 1/8 of the n items, sorting them (recursively), and putting them in an array. This creates n/8 buckets to which the remaining 7/8 of the items are distributed. Each bucket is then sorted, and the buckets are concatenated.

Does shuffle and sort occur simultaneously?

Shuffle and sort phase in Hadoop occur simultaneously and are done by the MapReduce framework.

How can the shuffle and sort performed in reduce side explain?

Shuffling in MapReduceIt is also the process by which the system performs the sort. Then it transfers the map output to the reducer as input. This is the reason shuffle phase is necessary for the reducers. Otherwise, they would not have any input (or input from every mapper).


1 Answers

Assuming you are using Hadoop Streaming, you need to use the KeyFieldBasedComparator class.

  1. -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should be added to streaming command

  2. You need to provide type of sorting required using mapred.text.key.comparator.options. Some useful ones are -n : numeric sort, -r : reverse sort

EXAMPLE :

Create an identity mapper and reducer with the following code

This is the mapper.py & reducer.py

#!/usr/bin/env python
import sys
for line in sys.stdin:    
    print "%s" % (line.strip())

This is the input.txt

1
11
2
20
7
3
40

This is the Streaming command

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar 
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator 
-D  mapred.text.key.comparator.options=-n 
-input /user/input.txt 
-output /user/output.txt 
-file ~/mapper.py 
-mapper ~/mapper.py 
-file ~/reducer.py 
-reducer ~/reducer.py

And you will get the required output

1   
2   
3   
7   
11  
20  
40

NOTE :

  1. I have used a simple one key input. If however you have multiple keys and/or partitions, you will have to edit mapred.text.key.comparator.options as needed. Since I do not know your use case , my example is limited to this

  2. Identity mapper is needed since you will need atleast one mapper for a MR job to run.

  3. Identity reducer is needed since shuffle/sort phase will not work if it is a pure map only job.

like image 79
Nicole Hu Avatar answered Sep 19 '22 15:09

Nicole Hu