Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop MapReduce Streaming sorting on multiple columns

Tags:

sorting

hadoop

I have mapreduce input that looks like this:

key1 \t 4.1 \t more ...
key1 \t 10.3 \t more ...
key2 \t 6.9 \t more ...
key2 \t 3 \t more ...

I want to sort by the first column then by second column (reverse numerical). Is there a way to achieve this Streaming MapReduce?

My current attempt is this:

hadoop jar hadoop-streaming-1.2.1.jar -Dnum.key.fields.for.partition=1 -Dmapred.text.key.comparator.options='-k1,2rn' -Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -mapper cat -reducer cat -file mr_base.py -file common.py -file mr_sort_combiner.py -input mr_combiner/2013_12_09__05_47_21/part-* -output mr_sort_combiner/2013_12_09__07_15_59/

But this sorts by first part of key and second (but does not sort second as numeric but rather as a string).

Any ideas on how I can sort two fields (one numeric and one textual)?

like image 326
Sami A. Haija Avatar asked Sep 04 '25 16:09

Sami A. Haija


1 Answers

you can achieve numerical sorting on multiple columns by specifying multiple k options in mapred.text.key.comparator.options (similarly to the linux sort command)

e.g. in bash

sort -k1,1 -k2rn

so for your example it would be

hadoop jar hadoop-streaming-1.2.1.jar \
    -Dmapred.text.key.comparator.options='-k1,1 - k2rn' \
    -Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
    -mapper cat \
    -reducer cat \
    -file mr_base.py \
    -file common.py \
    -file mr_sort_combiner.py \
    -input mr_combiner/2013_12_09__05_47_21/part-* \
    -output mr_sort_combiner/2013_12_09__07_15_59/
like image 197
blanche Avatar answered Sep 07 '25 11:09

blanche