Hadoop / MapReduce - Optimizing "Top N" Word Count MapReduce Job

Tags:

I'm working on something similar to the canonical MapReduce example - the word count, but with a twist in that I'm looking to only get the Top N results.

Let's say I have a very large set of text data in HDFS. There are plenty of examples that show how to build a Hadoop MapReduce job that will provide you with a word count for every word in that text. For example, if my corpus is:

"This is a test of test data and a good one to test this"

The result set from the standard MapReduce word count job would be:

test:3, a:2, this:2, is: 1, etc..

But what if I ONLY want to get the Top 3 words that were used in my entire set of data?

I can still run the exact same standard MapReduce word-count job, and then just take the Top 3 results once it is ready and is spitting out the count for EVERY word, but that seems a little inefficient, because a lot of data needs to be moved around during the shuffle phase.

What I'm thinking is that, if this sample is large enough, and the data is well randomly and well distributed in HDFS, that each Mapper does not need to send ALL of its word counts to the Reducers, but rather, only some of the top data. So if one mapper has this:

a:8234, the: 5422, man: 4352, ...... many more words ... , rareword: 1, weirdword: 1, etc.

Then what I'd like to do is only send the Top 100 or so words from each Mapper to the Reducer phase - since there is very little chance that "rareword" will suddenly end up in the Top 3 when all is said and done. This seems like it would save on bandwidth and also on Reducer processing time.

Can this be done in the Combiner phase? Is this sort of optimization prior to the shuffle phase commonly done?

774

asked Nov 28 '12 16:11

Rob Goretsky

1 Answers

This is a very good question, because you have hit the inefficiency of Hadoop's word count example.

The tricks to optimize your problem are the following:

Do a HashMap based grouping in your local map stage, you can also use a combiner for that. This can look like this, I'm using the HashMultiSet of Guava, which faciliates a nice counting mechanism.

    public static class WordFrequencyMapper extends
      Mapper<LongWritable, Text, Text, LongWritable> {

    private final HashMultiset<String> wordCountSet = HashMultiset.create();

    @Override
    protected void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {

      String[] tokens = value.toString().split("\\s+");
      for (String token : tokens) {
        wordCountSet.add(token);
      }
    }

And you emit the result in your cleanup stage:

@Override
protected void cleanup(Context context) throws IOException,
    InterruptedException {
  Text key = new Text();
  LongWritable value = new LongWritable();
  for (Entry<String> entry : wordCountSet.entrySet()) {
    key.set(entry.getElement());
    value.set(entry.getCount());
    context.write(key, value);
  }
}

So you have grouped the words in a local block of work, thus reducing network usage by using a bit of RAM. You can also do the same with a Combiner, but it is sorting to group- so this would be slower (especially for strings!) than using a HashMultiset.

To just get the Top N, you will only have to write the Top N in that local HashMultiset to the output collector and aggregate the results in your normal way on the reduce side. This saves you a lot of network bandwidth as well, the only drawback is that you need to sort the word-count tuples in your cleanup method.

A part of the code might look like this:

  Set<String> elementSet = wordCountSet.elementSet();
  String[] array = elementSet.toArray(new String[elementSet.size()]);
  Arrays.sort(array, new Comparator<String>() {

    @Override
    public int compare(String o1, String o2) {
      // sort descending
      return Long.compare(wordCountSet.count(o2), wordCountSet.count(o1));
    }

  });
  Text key = new Text();
  LongWritable value = new LongWritable();
  // just emit the first n records
  for(int i = 0; i < N, i++){
    key.set(array[i]);
    value.set(wordCountSet.count(array[i]));
    context.write(key, value);
  }

Hope you get the gist of doing as much of the word locally and then just aggregate the top N of the top N's ;)

126

answered Sep 24 '22 01:09

Thomas Jungblut

Related questions
                            
                                How to calculate median of a Map<Int,Int>?
                            
                                Best data structure for nearest neighbour in 1 dimension
                            
                                transforming a matrix into a vector along its diagonals
                            
                                Finding the optimum file size combination
                            
                                Distributed algorithm to compute the balance of the parentheses
                            
                                Can K-means be used to help in pixel-value based separation of an image?
                            
                                Traverse every unique path (from root to leaf) in an arbitrary tree structure
                            
                                Is the Turtle and Rabbit algorithm always O(N)?
                            
                                How to measure complexity of a string?
                            
                                Method to detect intersection between a rectangle and a polygon?
                            
                                Find rank of a number on basis of number of 1's
                            
                                Incremental graph algorithms
                            
                                How can I find the maximum sum of a sub-sequence using dynamic programming?
                            
                                Removing items from unevenly distributed set
                            
                                Edmonds-Karp Algorithm for a graph which has nodes with flow capacities
                            
                                kth smallest element out of an non-unique sorted array
                            
                                How to automatic adjust color and contrast
                            
                                Anyone have a sort algorithm optimized for very slow write operations?
                            
                                Viterbi training or Baum-Welch algorithm to estimate the transition and emission probabilities?
                            
                                Find in python combinations of mutually exclusive sets from a list's elements

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hadoop / MapReduce - Optimizing "Top N" Word Count MapReduce Job

Tags:

algorithm

hadoop

mapreduce

Rob Goretsky

People also ask

1 Answers

Thomas Jungblut

Recent Activity

Donate For Us