I have a file containing a String, then a space and then a number on every line.
Example:
Line1: Word 2
Line2 : Word1 8
Line3: Word2 1
I need to sort the number in descending order and then put the result in a file assigning a rank to the numbers. So my output should be a file containing the following format:
Line1: Word1 8 1
Line2: Word 2 2
Line3: Word2 1 3
Does anyone has an idea, how can I do it in Hadoop? I am using java with Hadoop.
So, all you values that needs to be sorted should be the key in your mapreduce job. Hadoop by default sorts by ascending order of key. Hence, either you do this to sort in descending order, job.
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce implements sorting algorithm to automatically sort the output key-value pairs from the mapper by their keys. Sorting methods are implemented in the mapper class itself.
Secondary sorting means sorting to be done on two or more field values of the same or different data types. Additionally we might also have deal with grouping and partitioning. The best and most efficient way to do secondary sorting in Hadoop is by writing our own key class.
You could organize your map/reduce computation like this:
Map input: default
Map output: "key: number, value: word"
_ sorting phase by key _
Here you will need to override the default sorter to sort in decreasing order.
Reduce - 1 reducer
Reduce input: "key: number, value: word"
Reduce output: "key: word, value: (number, rank)"
Keep a global counter. For each key-value pair add the rank by incrementing the counter.
Edit: Here is a code snipped of a custom descendant sorter:
public static class IntComparator extends WritableComparator {
public IntComparator() {
super(IntWritable.class);
}
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
Integer v1 = ByteBuffer.wrap(b1, s1, l1).getInt();
Integer v2 = ByteBuffer.wrap(b2, s2, l2).getInt();
return v1.compareTo(v2) * (-1);
}
}
Don't forget to actually set it as the comparator for your job:
job.setSortComparatorClass(IntComparator.class);
Hadoop Streaming - Hadoop 1.0.x
According to this, after the
bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.*.jar
you add a comparator
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
you specify the kind of sorting you want
-D mapred.text.key.comparator.options=-[ options]
where the [ options] are similar to Unix sort. Here are some examples,
Reverse order
-D mapred.text.key.comparator.options=-r
Sort on numeric values
-D mapred.text.key.comparator.options=-n
Sort on value or whatever field
-D mapred.text.key.comparator.options=-kx,y
with the -k flag you specify the key of sorting. The x, y parameters define this key. So, if you have a line with more than one tokens, you can choose which token of all will be the key of sorting or which combination of tokens will be the key of sorting. See the references for more details and examples.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With