The data looks like this, first field is a number,
3 ...
1 ...
2 ...
11 ...
And I want to sort these lines according to the first field numerically instead of alphabetically, which means after sorting it should look like this,
1 ...
2 ...
3 ...
11 ...
But hadoop keeps giving me this,
1 ...
11 ...
2 ...
3 ...
How do correct it?
Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer in MapReduce. Sort phase in MapReduce covers the merging and sorting of map outputs. Data from the mapper are grouped by the key, split among reducers, and sorted by the key. Every reducer obtains all values associated with the same key.
Definition: A distribution sort algorithm that begins by removing the first 1/8 of the n items, sorting them (recursively), and putting them in an array. This creates n/8 buckets to which the remaining 7/8 of the items are distributed. Each bucket is then sorted, and the buckets are concatenated.
Shuffle and sort phase in Hadoop occur simultaneously and are done by the MapReduce framework.
Shuffling in MapReduceIt is also the process by which the system performs the sort. Then it transfers the map output to the reducer as input. This is the reason shuffle phase is necessary for the reducers. Otherwise, they would not have any input (or input from every mapper).
Assuming you are using Hadoop Streaming, you need to use the KeyFieldBasedComparator class.
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should be added to streaming command
You need to provide type of sorting required using mapred.text.key.comparator.options. Some useful ones are -n : numeric sort, -r : reverse sort
EXAMPLE :
Create an identity mapper and reducer with the following code
This is the mapper.py & reducer.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
print "%s" % (line.strip())
This is the input.txt
1
11
2
20
7
3
40
This is the Streaming command
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
-D mapred.text.key.comparator.options=-n
-input /user/input.txt
-output /user/output.txt
-file ~/mapper.py
-mapper ~/mapper.py
-file ~/reducer.py
-reducer ~/reducer.py
And you will get the required output
1
2
3
7
11
20
40
NOTE :
I have used a simple one key input. If however you have multiple keys and/or partitions, you will have to edit mapred.text.key.comparator.options as needed. Since I do not know your use case , my example is limited to this
Identity mapper is needed since you will need atleast one mapper for a MR job to run.
Identity reducer is needed since shuffle/sort phase will not work if it is a pure map only job.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With