Top N values by Hadoop Map Reduce code

1 Answers

You have two obvious options:

Have two MapReduce jobs:

WordCount: counts all the words (pretty much the example exactly)
TopN: A MapReduce job that finds the top N of something (here are some examples: source code, blog post)

Have the output of WordCount write to HDFS. Then, have TopN read that output. This is called job chaining and there are a number of ways to solve this problem: oozie, bash scripts, firing two jobs from your driver, etc.

The reason you need two jobs is you are doing two aggregations: one is word count, and the second is topN. Typically in MapReduce each aggregation requires its own MapReduce job.

First, have your WordCount job run on the data. Then, use some bash to pull the top N out.

hadoop fs -cat /output/of/wordcount/part* | sort -n -k2 -r | head -n20

sort -n -k2 -r says "sort numerically by column #2, in descending order". head -n20 pulls the top twenty.

This is the better option for WordCount, just because WordCount will probably only output on the order of thousands or tens of thousands of lines and you don't need a MapReduce job for that. Remember that just because you have hadoop around doesn't mean you should solve all your problems with Hadoop.

One non-obvious version, which is tricky but a mix of both of the above...

Write a WordCount MapReduce job, but in the Reducer do something like in the TopN MapReduce jobs I showed you earlier. Then, have each reducer output only the TopN results from that reducer.

So, if you are doing Top 10, each reducer will output 10 results. Let's say you have 30 reducers, you'll output 300 results.

Then, do the same thing as in option #2 with bash:

hadoop fs -cat /output/of/wordcount/part* | sort -n -k2 -r | head -n10

This should be faster because you are only postprocessing a fraction of the results.

This is the fastest way I can think of doing this, but it's probably not worth the effort.

139

answered Oct 11 '22 22:10

Donald Miner

Related questions
                            
                                Hadoop: How to unit test FileSystem
                            
                                Getting the following error "Datanode denied communication with namenode" while configuring hadoop 0.23.8
                            
                                Type mismatch in value from map: expected org.apache.hadoop.io.NullWritable, recieved org.apache.hadoop.io.Text
                            
                                Sampling a large distributed data set using pyspark / spark
                            
                                Hadoop: Cannot use Jps command
                            
                                Difference between Hadoop and Nosql [closed]
                            
                                Hadoop fs lookup for block size?
                            
                                Hadoop on MAC pseudo node : nodename nor servname provided, or not known
                            
                                Split size vs Block size in Hadoop
                            
                                Container killed by the ApplicationMaster Exit code is 143
                            
                                Hadoop on EC2 vs Elastic Map Reduce
                            
                                How does Apache Spark know about HDFS data nodes?
                            
                                hadoop connection refused on port 9000
                            
                                How does Hive choose the number of reducers for a job?
                            
                                Hadoop MapReduce vs MPI (vs Spark vs Mahout vs Mesos) - When to use one over the other?
                            
                                How to execute Spark programs with Dynamic Resource Allocation?
                            
                                Failed to detect a valid hadoop home directory
                            
                                How to find the most recent partition in HIVE table
                            
                                Spark without Hadoop: Failed to Launch
                            
                                How to fix Hadoop WARNING: An illegal reflective access operation has occurred error on Ubuntu

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Top N values by Hadoop Map Reduce code

Tags:

hadoop

mapreduce

user3078014

People also ask

1 Answers

Donald Miner

Recent Activity

Donate For Us