Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why submitting job to mapreduce takes so much time in General?

So usually for 20 node cluster submitting job to process 3GB(200 splits) of data takes about 30sec and actual execution about 1m. I want to understand what is the bottleneck in job submitting process and understand next quote

Per-MapReduce overhead is significant: Starting/ending MapReduce job costs time

Some process I'm aware: 1. data splitting 2. jar file sharing

like image 268
yura Avatar asked Jul 06 '12 20:07

yura


3 Answers

A few things to understand about HDFS and M/R that helps understand this latency:

  1. HDFS stores your files as data chunk distributed on multiple machines called datanodes
  2. M/R runs multiple programs called mapper on each of the data chunks or blocks. The (key,value) output of these mappers are compiled together as result by reducers. (Think of summing various results from multiple mappers)
  3. Each mapper and reducer is a full fledged program that is spawned on these distributed system. It does take time to spawn a full fledged programs, even if let us say they did nothing (No-OP map reduce programs).
  4. When the size of data to be processed becomes very big, these spawn times become insignificant and that is when Hadoop shines.

If you were to process a file with a 1000 lines content then you are better of using a normal file read and process program. Hadoop infrastructure to spawn a process on a distributed system will not yield any benefit but will only contribute to the additional overhead of locating datanodes containing relevant data chunks, starting the processing programs on them, tracking and collecting results.

Now expand that to 100 of Peta Bytes of data and these overheads looks completely insignificant compared to time it would take to process them. Parallelization of the processors (mappers and reducers) will show it's advantage here.

So before analyzing the performance of your M/R, you should first look to benchmark your cluster so that you understand the overheads better.

How much time does it take to do a no-operation map-reduce program on a cluster?

Use MRBench for this purpose:

  1. MRbench loops a small job a number of times
  2. Checks whether small job runs are responsive and running efficiently on your cluster.
  3. Its impact on the HDFS layer is very limited

To run this program, try the following (Check the correct approach for latest versions:

hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 50

Surprisingly on one of our dev clusters it was 22 seconds.

Another issue is file size.

If the file sizes are less than the HDFS block size then Map/Reduce programs have significant overhead. Hadoop will typically try to spawn a mapper per block. That means if you have 30 5KB files, then Hadoop may end up spawning 30 mappers eventually per block even if the size of file is small. This is a real wastage as each program overhead is significant compared to the time it would spend processing the small sized file.

like image 111
pyfunc Avatar answered Sep 20 '22 00:09

pyfunc


As far as I know, there is no single bottleneck which causes the job run latency; if there was, it would have been solved a long time ago.

There are a number of steps which takes time, and there are reasons why the process is slow. I will try to list them and estimate where I can:

  1. Run hadoop client. It is running Java, and I think about 1 second overhead can be assumed.
  2. Put job into the queue and let the current scheduler to run the job. I am not sure what is overhead, but, because of async nature of the process some latency should exists.
  3. Calculating splits.
  4. Running and syncronizing tasks. Here we face with the fact that TaskTrackes poll the JobTracker, and not opposite. I think it is done for the scalability sake. It mean that when JobTracker wants to execute some task, it do not call task tracker, but wait that approprieate tracker will ping it to get the job. Task trackers can not ping JobTracker to frequently, otherwise they will kill it in large clusters.
  5. Running tasks. Without JVM reuse it takes about 3 seconds, with it overhead is about 1 seconds per task.
  6. Client poll job tracker for the results (at least I think so) and it also add some latency to getting information that job is finished.
like image 30
David Gruzman Avatar answered Sep 19 '22 00:09

David Gruzman


I have seen similar issue and I can state the solution to be broken in following steps :

  1. When the HDFS stores too many small files with fixed chunk size, there will be issues on efficiency in HDFS, the best way would be to remove all unnecessary files and small files having data. Try again.
  2. Try with the data nodes and name nodes:

    • Stop all the services using stop-all.sh.
    • Format name-node
    • Reboot machine
    • Start all services using start-all.sh
    • Check data and name nodes.
  3. Try installing lower version of hadoop (hadoop 2.5.2) which worked in two cases and it worked in hit and trial.

like image 41
JayPadhya Avatar answered Sep 22 '22 00:09

JayPadhya