I'm very much new to MapReduce and I completed a Hadoop word-count example.
In that example it produces unsorted file (with key-value pairs) of word counts. So is it possible to sort it by number of word occurrences by combining another MapReduce task with the earlier one?
The text from the input text file is tokenized into words to form a key value pair with all the words present in the input text file. The key is the word from the input file and value is '1'. This is how the MapReduce word count program executes and outputs the number of occurrences of a word in any given input file.
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce implements sorting algorithm to automatically sort the output key-value pairs from the mapper by their keys. Sorting methods are implemented in the mapper class itself.
MapReduce splits the input into smaller chunks called input splits, representing a block of work with a single mapper task. The input data is processed and divided into smaller segments in the mapper phase, where the number of mappers is equal to the number of input splits.
Example: WordCount v1. 0. Before we jump into the details, lets walk through an example MapReduce application to get a flavour for how they work. WordCount is a simple application that counts the number of occurrences of each word in a given input set.
In simple word count map reduce program the output we get is sorted by words. Sample output can be :
Apple 1
Boy 30
Cat 2
Frog 20
Zebra 1
If you want output to be sorted on the basis of number of occrance of words, i.e in below format
1 Apple
1 Zebra
2 Cat
20 Frog
30 Boy
You can create another MR program using below mapper and reducer where the input will be the output got from simple word count program.
class Map1 extends MapReduceBase implements Mapper<Object, Text, IntWritable, Text>
{
public void map(Object key, Text value, OutputCollector<IntWritable, Text> collector, Reporter arg3) throws IOException
{
String line = value.toString();
StringTokenizer stringTokenizer = new StringTokenizer(line);
{
int number = 999;
String word = "empty";
if(stringTokenizer.hasMoreTokens())
{
String str0= stringTokenizer.nextToken();
word = str0.trim();
}
if(stringTokenizer.hasMoreElements())
{
String str1 = stringTokenizer.nextToken();
number = Integer.parseInt(str1.trim());
}
collector.collect(new IntWritable(number), new Text(word));
}
}
}
class Reduce1 extends MapReduceBase implements Reducer<IntWritable, Text, IntWritable, Text>
{
public void reduce(IntWritable key, Iterator<Text> values, OutputCollector<IntWritable, Text> arg2, Reporter arg3) throws IOException
{
while((values.hasNext()))
{
arg2.collect(key, values.next());
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With