Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Standard practices for logging in MapReduce jobs

I'm trying to find the best approach for logging in MapReduce jobs. I'm using slf4j with log4j appender as in my other Java applications, but since MapReduce job runs in a distributed manner across the cluster I don't know where should I set the log file location, since it is a shared cluster with limited access privileges.

Is there any standard practices for logging in MapReduce jobs, so you can easily be able to look at the logs across the cluster after the job completes?

like image 859
Frank Avatar asked Jan 23 '15 21:01

Frank


People also ask

What are the three main phases of a MapReduce job?

MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. Map stage − The map or mapper's job is to process the input data.

Where MapReduce jobs are submitted?

Basically, client submits the job through Resource Manager. Resource Manager, being master node, allocate the resources needed for the job to run and keeps track of cluster utilization. It also, initiates an application master for each job who is responsible to co-ordinate the job execution.


1 Answers

You could use log4j which is the default logging framework that hadoop uses. So, from your MapReduce application you could do something like this:

import org.apache.log4j.Logger;
// other imports omitted

public class SampleMapper extends Mapper<LongWritable, Text, Text, Text> {
    private Logger logger = Logger.getLogger(SampleMapper.class);

    @Override
    protected void setup(Context context) {
        logger.info("Initializing NoSQL Connection.")
        try {
            // logic for connecting to NoSQL - ommitted
        } catch (Exception ex) {
            logger.error(ex.getMessage());
        }
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // mapper code ommitted
    }
}        

This sample code will user log4j logger to log events to the inherited Mapper logger. All the log events will be logged to their respective task log's. You could visit the task logs from either JobTracker(MRv1)/ResourceManager(MRv2) webpage.

If you are using yarn you could access the application logs from command line using the following command:

yarn logs -applicationId <application_id>

While if you are using mapreduce v1, there is no single point of access from command line; hence you have to log into each TaskTracker and look in the configured path generally /var/log/hadoop/userlogs/attempt_<job_id>/syslog specified in ${hadoop.log.dir}/userlogs contains log4j output.

like image 146
Ashrith Avatar answered Oct 16 '22 09:10

Ashrith