Hadoop MapReduce - one output file for each input

Question

I'm new to Hadoop and I'm trying to figure out how it works. As for an exercise I should implement something similar to the WordCount-Example. The task is to read in several files, do the WordCount and write an output file for each input file. Hadoop uses a combiner and shuffles the output of the map-part as an input for the reducer, then writes one output file (I guess for each instance that is running). I was wondering if it is possible to write one output file for each input file (so keep the words of inputfile1 and write result to outputfile1 and so on). Is it possible to overwrite the Combiner-Class or is there another solution for this (I'm not sure if this should even be solved in a Hadoop-Task but this is the exercise).

Thanks...

Praveen Sripati · Accepted Answer

map.input.file environment parameter has the file name which the mapper is processing. Get this value in the mapper and use this as the output key for the mapper and then all the k/v from a single file to go to one reducer.

The code in the mapper. BTW, I am using the old MR API

@Override
public void configure(JobConf conf) {
    this.conf = conf;
}

@Override.
public void map(................) throws IOException {

        String filename = conf.get("map.input.file");
        output.collect(new Text(filename), value);
}

And use MultipleOutputFormat, this allows to write multiple output files for the job. The file names can be derived from the output keys and values.

Hadoop MapReduce - one output file for each input

Tags:

java

hadoop

mapreduce

spooky

1 Answers

Praveen Sripati

Recent Activity

Donate For Us

Hadoop MapReduce - one output file for each input

Tags:

java

hadoop

mapreduce

spooky

1 Answers

Praveen Sripati

Related questions

Recent Activity

Donate For Us