Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Top N values by Hadoop Map Reduce code

I am very new in hadoop world and struggling to achieve one simple task.

Can anybody please tell me how to get top n values for word count example by using only Map reduce code technique?

I do not want to use any hadoop command for this simple task.

like image 492
user3078014 Avatar asked Dec 14 '13 12:12

user3078014


People also ask

How do you find the top N records using MapReduce concept?

Approach Used: Using TreeMap. Here, the idea is to use Mappers to find local top 10 records, as there can be many Mappers running parallelly on different blocks of data of a file. And then all these local top 10 records will be aggregated at Reducer where we find top 10 global records for the file.

How do you decide the number of reduce in MapReduce?

The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>). With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish.

What is MapReduce in Hadoop?

MapReduce is a Hadoop framework used for writing applications that can process vast amounts of data on large clusters. It can also be called a programming model in which we can process large datasets across computer clusters. This application allows data to be stored in a distributed form.

Can we have 1 Mapper and 2 reducers?

No it is not possible.


1 Answers

You have two obvious options:


Have two MapReduce jobs:

  1. WordCount: counts all the words (pretty much the example exactly)
  2. TopN: A MapReduce job that finds the top N of something (here are some examples: source code, blog post)

Have the output of WordCount write to HDFS. Then, have TopN read that output. This is called job chaining and there are a number of ways to solve this problem: oozie, bash scripts, firing two jobs from your driver, etc.

The reason you need two jobs is you are doing two aggregations: one is word count, and the second is topN. Typically in MapReduce each aggregation requires its own MapReduce job.


First, have your WordCount job run on the data. Then, use some bash to pull the top N out.

hadoop fs -cat /output/of/wordcount/part* | sort -n -k2 -r | head -n20

sort -n -k2 -r says "sort numerically by column #2, in descending order". head -n20 pulls the top twenty.

This is the better option for WordCount, just because WordCount will probably only output on the order of thousands or tens of thousands of lines and you don't need a MapReduce job for that. Remember that just because you have hadoop around doesn't mean you should solve all your problems with Hadoop.


One non-obvious version, which is tricky but a mix of both of the above...

Write a WordCount MapReduce job, but in the Reducer do something like in the TopN MapReduce jobs I showed you earlier. Then, have each reducer output only the TopN results from that reducer.

So, if you are doing Top 10, each reducer will output 10 results. Let's say you have 30 reducers, you'll output 300 results.

Then, do the same thing as in option #2 with bash:

hadoop fs -cat /output/of/wordcount/part* | sort -n -k2 -r | head -n10

This should be faster because you are only postprocessing a fraction of the results.

This is the fastest way I can think of doing this, but it's probably not worth the effort.

like image 139
Donald Miner Avatar answered Oct 11 '22 22:10

Donald Miner