Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting by value in Hadoop from a file

I have a file containing a String, then a space and then a number on every line.

Example:

Line1: Word 2
Line2 : Word1 8
Line3: Word2 1

I need to sort the number in descending order and then put the result in a file assigning a rank to the numbers. So my output should be a file containing the following format:

Line1: Word1 8 1
Line2: Word  2 2
Line3: Word2 1 3

Does anyone has an idea, how can I do it in Hadoop? I am using java with Hadoop.

like image 395
Deepika Sethi Avatar asked Nov 27 '11 22:11

Deepika Sethi


People also ask

What does Hadoop sort by?

So, all you values that needs to be sorted should be the key in your mapreduce job. Hadoop by default sorts by ascending order of key. Hence, either you do this to sort in descending order, job.

Can sorting be done with MapReduce?

Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce implements sorting algorithm to automatically sort the output key-value pairs from the mapper by their keys. Sorting methods are implemented in the mapper class itself.

What is secondary sorting?

Secondary sorting means sorting to be done on two or more field values of the same or different data types. Additionally we might also have deal with grouping and partitioning. The best and most efficient way to do secondary sorting in Hadoop is by writing our own key class.


2 Answers

You could organize your map/reduce computation like this:

Map input: default

Map output: "key: number, value: word"

_ sorting phase by key _

Here you will need to override the default sorter to sort in decreasing order.

Reduce - 1 reducer

Reduce input: "key: number, value: word"

Reduce output: "key: word, value: (number, rank)"

Keep a global counter. For each key-value pair add the rank by incrementing the counter.

Edit: Here is a code snipped of a custom descendant sorter:

public static class IntComparator extends WritableComparator {

    public IntComparator() {
        super(IntWritable.class);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1,
            byte[] b2, int s2, int l2) {

        Integer v1 = ByteBuffer.wrap(b1, s1, l1).getInt();
        Integer v2 = ByteBuffer.wrap(b2, s2, l2).getInt();

        return v1.compareTo(v2) * (-1);
    }
}

Don't forget to actually set it as the comparator for your job:

job.setSortComparatorClass(IntComparator.class);
like image 148
Tudor Avatar answered Sep 21 '22 00:09

Tudor


Hadoop Streaming - Hadoop 1.0.x

According to this, after the

bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.*.jar
  1. you add a comparator

    -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator

  2. you specify the kind of sorting you want

    -D mapred.text.key.comparator.options=-[ options]

where the [ options] are similar to Unix sort. Here are some examples,

Reverse order

-D mapred.text.key.comparator.options=-r

Sort on numeric values

-D mapred.text.key.comparator.options=-n

Sort on value or whatever field

-D mapred.text.key.comparator.options=-kx,y

with the -k flag you specify the key of sorting. The x, y parameters define this key. So, if you have a line with more than one tokens, you can choose which token of all will be the key of sorting or which combination of tokens will be the key of sorting. See the references for more details and examples.

like image 22
vpap Avatar answered Sep 20 '22 00:09

vpap