Hadoop MapReduce Streaming sorting on multiple columns

Question

I have mapreduce input that looks like this:

key1 	 4.1 	 more ...
key1 	 10.3 	 more ...
key2 	 6.9 	 more ...
key2 	 3 	 more ...

I want to sort by the first column then by second column (reverse numerical). Is there a way to achieve this Streaming MapReduce?

My current attempt is this:

hadoop jar hadoop-streaming-1.2.1.jar -Dnum.key.fields.for.partition=1 -Dmapred.text.key.comparator.options='-k1,2rn' -Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -mapper cat -reducer cat -file mr_base.py -file common.py -file mr_sort_combiner.py -input mr_combiner/2013_12_09__05_47_21/part-* -output mr_sort_combiner/2013_12_09__07_15_59/

But this sorts by first part of key and second (but does not sort second as numeric but rather as a string).

Any ideas on how I can sort two fields (one numeric and one textual)?

blanche · Accepted Answer

you can achieve numerical sorting on multiple columns by specifying multiple k options in mapred.text.key.comparator.options (similarly to the linux sort command)

e.g. in bash

sort -k1,1 -k2rn

so for your example it would be

hadoop jar hadoop-streaming-1.2.1.jar \
    -Dmapred.text.key.comparator.options='-k1,1 - k2rn' \
    -Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
    -mapper cat \
    -reducer cat \
    -file mr_base.py \
    -file common.py \
    -file mr_sort_combiner.py \
    -input mr_combiner/2013_12_09__05_47_21/part-* \
    -output mr_sort_combiner/2013_12_09__07_15_59/

Hadoop MapReduce Streaming sorting on multiple columns

Tags:

sorting

hadoop

Sami A. Haija

1 Answers

blanche

Recent Activity

Donate For Us

Hadoop MapReduce Streaming sorting on multiple columns

Tags:

sorting

hadoop

Sami A. Haija

1 Answers

blanche

Related questions

Recent Activity

Donate For Us