How to find time spent by each mapper and reducer as well as time for shuffling (sorting) within the code (not in web interface) in Hadoop? How about total time by all mapper (or reducers?
There is an API for the JobTracker
as described here which gives you a bunch of information on the cluster itself as well as details for all jobs.
In particular, if you know the job id and you want to find metrics for each individual map and reduce tasks, you could call getMapTaskReports
which will return a TaskReport
instance detailed here which gives you access to methods such as getFinishTime
or getStartTime
. So for example:
TaskReport[] maps = jobtracker.getMapTaskReports("your_job_id");
for (TaskReport rpt : maps) {
long duration = rpt.getFinishTime() - rpt.getStartTime();
System.out.println("Mapper duration: " + duration);
}
TaskReport[] reduces = jobtracker.getReduceTaskReports("your_job_id");
for (TaskReport rpt : reduces) {
long duration = rpt.getFinishTime() - rpt.getStartTime();
System.out.println("Reducer duration: " + duration);
}
To count the total time by all mappers or reducers in your job, you could just sum them up simply in the code.
And regarding the shuffling, this is generally counted in the jobtracker as 33% of each reduce task, which does not necessarily mean it's 33% of the time but I don't think there's an automated way to get the shuffling time per task so you could just go with this simple heuristic with 33%.
Please take into account though that by using time measurements from the jobtracker API like shown above, the time in reducers might be a bit biased, because when a reduce task starts it essentially does the shuffling (up to 33% as explained), then it waits until all map tasks are finished, and only then does it start the actual reduce, so a reduce measurement is actually the sum of these 3 periods (shuffle + wait + reduce).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With