I'm a newbie in Hadoop. I'm trying out the Wordcount program.
Now to try out multiple output files, i use MultipleOutputFormat
. this link helped me in doing it. http://hadoop.apache.org/common/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
in my driver class i had
MultipleOutputs.addNamedOutput(conf, "even",
org.apache.hadoop.mapred.TextOutputFormat.class, Text.class,
IntWritable.class);
MultipleOutputs.addNamedOutput(conf, "odd",
org.apache.hadoop.mapred.TextOutputFormat.class, Text.class,
IntWritable.class);`
and my reduce class became this
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
MultipleOutputs mos = null;
public void configure(JobConf job) {
mos = new MultipleOutputs(job);
}
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
if (sum % 2 == 0) {
mos.getCollector("even", reporter).collect(key, new IntWritable(sum));
}else {
mos.getCollector("odd", reporter).collect(key, new IntWritable(sum));
}
//output.collect(key, new IntWritable(sum));
}
@Override
public void close() throws IOException {
// TODO Auto-generated method stub
mos.close();
}
}
Things worked , but i get LOT of files, (one odd and one even for every map-reduce)
Question is : How can i have just 2 output files (odd & even) so that every odd output of every map-reduce gets written into that odd file, and same for even.
MultipleOutputs MultipleOutputs class provide facility to write Hadoop map/reducer output to more than one folders. Basically, we can use MultipleOutputs when we want to write outputs other than map reduce job default output and write map reduce job output to different files provided by a user.
Can we configure mappers to write output on HDFS ? The output of Mapper is not written on HDFS because, the Block of data are replicated in the datanode based on the replication factor and namenode should hold the metadata of blocks.
Each reducer uses an OutputFormat to write records to. So that's why you are getting a set of odd and even files per reducer. This is by design so that each reducer can perform writes in parallel.
If you want just a single odd and single even file, you'll need to set mapred.reduce.tasks to 1. But performance will suffer, because all the mappers will be feeding into a single reducer.
Another option is to change the process the reads these files to accept multiple input files, or write a separate process that merges these files together.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With