Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the purpose of the org.apache.hadoop.mapreduce.Mapper.run() function in Hadoop?

What is the purpose of the org.apache.hadoop.mapreduce.Mapper.run() function in Hadoop? The setup() is called before calling the map() and the clean() is called after the map(). The documentation for the run() says

Expert users can override this method for more complete control over the execution of the Mapper.

I am looking for the practical purpose of this function.

like image 947
Praveen Sripati Avatar asked Sep 18 '11 06:09

Praveen Sripati


People also ask

What is the function of Hadoop MapReduce?

MapReduce is a Hadoop framework used for writing applications that can process vast amounts of data on large clusters. It can also be called a programming model in which we can process large datasets across computer clusters. This application allows data to be stored in a distributed form.

What is the role of mapper in Hadoop?

Hadoop Mapper is a function or task which is used to process all input records from a file and generate the output which works as input for Reducer. It produces the output by returning new key-value pairs.

What is the role of mapper and reducer in Hadoop platform?

The mapper processes the data and creates several small chunks of data. Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer's job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.

What is the purpose of MapReduce explain it with suitable example?

MapReduce is a programming model used to perform distributed processing in parallel in a Hadoop cluster, which Makes Hadoop working so fast. When you are dealing with Big Data, serial processing is no more of any use. MapReduce has mainly two tasks which are divided phase-wise: Map Task.


2 Answers

The default run() method simply takes each key / value pair supplied by the context and calls the map() method:

public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    while (context.nextKeyValue()) {
       map(context.getCurrentKey(), context.getCurrentValue(), context);
    }
    cleanup(context);
}

If you wanted to do more than that ... you'd need to override it. For example, the MultithreadedMapper class

like image 121
Brian Roach Avatar answered Sep 23 '22 10:09

Brian Roach


I just came up with a fairly odd case for using this.

Occasionally I've found that I want a mapper that consumes all its input before producing any output. I've done it in the past by performing the record writes in my cleanup function. My map function doesn't actually output any records, it just reads the input and stores whatever will be needed in private structures.

It turns out that this approach works fine unless you're producing a LOT of output. The best I can make out is that the mapper's spill facility doesn't operate during cleanup. So the records that are produced just keep accumulating in memory, and if there are too many of them you risk heap exhaustion. This is my speculation of what's going on - could be wrong. But definitely the problem goes away with my new approach.

That new approach is to override run() instead of cleanup(). My only change to the default run() is that after the last record has been delivered to map(), I call map() once more with null key and value. That's a signal to my map() function to go ahead and produce its output. In this case, with the spill facility still operable, memory usage stays in check.

like image 31
Andy Lowry Avatar answered Sep 26 '22 10:09

Andy Lowry