I have mapreduce input that looks like this:
key1 \t 4.1 \t more ...
key1 \t 10.3 \t more ...
key2 \t 6.9 \t more ...
key2 \t 3 \t more ...
I want to sort by the first column then by second column (reverse numerical). Is there a way to achieve this Streaming MapReduce?
My current attempt is this:
hadoop jar hadoop-streaming-1.2.1.jar -Dnum.key.fields.for.partition=1 -Dmapred.text.key.comparator.options='-k1,2rn' -Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -mapper cat -reducer cat -file mr_base.py -file common.py -file mr_sort_combiner.py -input mr_combiner/2013_12_09__05_47_21/part-* -output mr_sort_combiner/2013_12_09__07_15_59/
But this sorts by first part of key and second (but does not sort second as numeric but rather as a string).
Any ideas on how I can sort two fields (one numeric and one textual)?
you can achieve numerical sorting on multiple columns by specifying multiple k options in mapred.text.key.comparator.options (similarly to the linux sort command)
e.g. in bash
sort -k1,1 -k2rn
so for your example it would be
hadoop jar hadoop-streaming-1.2.1.jar \
-Dmapred.text.key.comparator.options='-k1,1 - k2rn' \
-Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-mapper cat \
-reducer cat \
-file mr_base.py \
-file common.py \
-file mr_sort_combiner.py \
-input mr_combiner/2013_12_09__05_47_21/part-* \
-output mr_sort_combiner/2013_12_09__07_15_59/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With