Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does CPU Time for a Hadoop Job signify?

I am afraid I do not understand the timing results of a Map-Reduce job. For example, a job I am running gives me the following results from the job tracker.

Finished in: 1mins, 39sec

CPU time spent (ms) 150,460 152,030 302,490

The entries in CPU time spent (ms) are for Map, Reduce and Total respectively. But, then how is "CPU time spent" being measured and what does it signify? Is this the total cumulative time spent in each of the mappers and reducers assigned for the job? Is it possible to measure other times from the framework such as time for shuffle, sort, partition etc? If so, how?

A second question which bothers me. I have seen some posts here (Link1, Link2 ) which suggest using getTime() in the driver class :

long start = new Date().getTime();
boolean status = job.waitForCompletion(true);
long end = new Date().getTime();
System.out.println("Job took "+(end-start) + "milliseconds");

Is this not doing what the first entry in Job Tracker output provides anyway? Is this necessary? What is the best way to time a hadoop job especially when I want to time IO time, compute time per node/ per stage ?

like image 823
Alps Avatar asked Jul 30 '12 17:07

Alps


1 Answers

The map phase consists of: record reader, map, combiner, and partitioner.

The reduce phase consists of: shuffle, sort, reduce, output.

The CPU time you are seeing there is of the entire map phase and the entire reduce phase... not just the function itself. This is kind of confusing terminology because you have the map function and reduce function, which are only a portion of the map phase and reduce phase. This is the total CPU time across all of the nodes in the cluster.

CPU time is hugely different form real time. CPU time is how much time something spent on the CPUs, while real time is what you and I experience as humans. Think about this: assume you have the same job running over the same data but on one 20 node cluster, then a 200 node cluster. Overall, the same amount of CPU time will be used on both clusters, but the 200 node cluster will run 10x faster in real time. CPU time is a useful metric when you have a shared system with lots of jobs running on it at the same time.

I don't know how you would dive deeper to get CPU time in each phase. Using a date timer is probably not what you are looking for though.

like image 97
Donald Miner Avatar answered Sep 27 '22 16:09

Donald Miner