Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MultipleOutputFormat in hadoop

I'm a newbie in Hadoop. I'm trying out the Wordcount program.

Now to try out multiple output files, i use MultipleOutputFormat. this link helped me in doing it. http://hadoop.apache.org/common/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html

in my driver class i had

    MultipleOutputs.addNamedOutput(conf, "even",
            org.apache.hadoop.mapred.TextOutputFormat.class, Text.class,
            IntWritable.class);

    MultipleOutputs.addNamedOutput(conf, "odd",
            org.apache.hadoop.mapred.TextOutputFormat.class, Text.class,
            IntWritable.class);`

and my reduce class became this

public static class Reduce extends MapReduceBase implements
        Reducer<Text, IntWritable, Text, IntWritable> {
    MultipleOutputs mos = null;

    public void configure(JobConf job) {
        mos = new MultipleOutputs(job);
    }

    public void reduce(Text key, Iterator<IntWritable> values,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }
        if (sum % 2 == 0) {
            mos.getCollector("even", reporter).collect(key, new IntWritable(sum));
        }else {
            mos.getCollector("odd", reporter).collect(key, new IntWritable(sum));
        }
        //output.collect(key, new IntWritable(sum));
    }
    @Override
    public void close() throws IOException {
        // TODO Auto-generated method stub
    mos.close();
    }
}

Things worked , but i get LOT of files, (one odd and one even for every map-reduce)

Question is : How can i have just 2 output files (odd & even) so that every odd output of every map-reduce gets written into that odd file, and same for even.

like image 654
raj Avatar asked Aug 16 '10 06:08

raj


People also ask

What is the use of MultipleOutputs?

MultipleOutputs MultipleOutputs class provide facility to write Hadoop map/reducer output to more than one folders. Basically, we can use MultipleOutputs when we want to write outputs other than map reduce job default output and write map reduce job output to different files provided by a user.

Can we write output from mapper directly to HDFS?

Can we configure mappers to write output on HDFS ? The output of Mapper is not written on HDFS because, the Block of data are replicated in the datanode based on the replication factor and namenode should hold the metadata of blocks.


1 Answers

Each reducer uses an OutputFormat to write records to. So that's why you are getting a set of odd and even files per reducer. This is by design so that each reducer can perform writes in parallel.

If you want just a single odd and single even file, you'll need to set mapred.reduce.tasks to 1. But performance will suffer, because all the mappers will be feeding into a single reducer.

Another option is to change the process the reads these files to accept multiple input files, or write a separate process that merges these files together.

like image 191
bajafresh4life Avatar answered Oct 06 '22 01:10

bajafresh4life