Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark tasks doesn't seem to be well distributed

I am running a Spark job, and it seems that the tasks are not well distributed (see attached). Is there a way to make the tasks more evenly distributed? Thanks!

enter image description here

like image 238
Edamame Avatar asked Jun 17 '15 04:06

Edamame


People also ask

How tasks are distributed in Spark?

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task.

How many tasks does Spark run on each partition?

Spark assigns one task per partition and each worker can process one task at a time.

Why is my Spark job so slow?

Sometimes, Spark runs slowly because there are too many concurrent tasks running. The capacity for high concurrency is a beneficial feature, as it provides Spark-native fine-grained sharing. This leads to maximum resource utilization while cutting down query latencies.

How Spark decides number of tasks?

Number of tasks execution in parallelNumber of CPU cores available for an executor determines the number of tasks that can be executed in parallel for an application for any given time.


1 Answers

Taking a closer look to the posted image, I can identify two main facts:

  • The number of tasks has been evenly distributed, with a maximum variation of 20 tasks.
  • The running time allocated for each executor differs significatively, from 3.0 mins (~80 tasks) to 17.0 min (~60 tasks).

This makes me wonder about the nature of your application. Are all the tasks equal or do some of them need more time to complete than others? If the tasks are heterogeneous, your issue needs to be looked more carefully. Imagine the following scenario:

  • Number of tasks: 20, where each one needs 10 seconds to finish except of the last one:

    Task 01: 10 seconds
    Task 02: 10 seconds
    Task 03: 10 seconds
    Task ...
    Task 20: 120 seconds
    
  • Number of executors: 4 (each with a single core)

If we had to evenly distribute the tasks, we would see that each executor would have to process 5 tasks in total. Taking into account that one executor is assigned with the 20th tasks, which needs 120 seconds to complete, the execution flow would be the following:

  • By the second 40, each executor would be able to complete the first 4 tasks, considering that the 20th task is left at the end.
  • By the second 50, each executor but one will have finished all of their tasks. The remaining executor would still be computing the 20th tasks, which would finish completing after 120 seconds.

At the end, the user interface would show a result similar to yours, with the number of tasks evenly distributed but not the actual computing time.

Executor 01 -> tasks completed: 5 -> time: 0:50 minutes
Executor 02 -> tasks completed: 5 -> time: 0:50 minutes
Executor 03 -> tasks completed: 5 -> time: 0:50 minutes
Executor 04 -> tasks completed: 5 -> time: 2:40 minutes

Although not the same, a similar thing might be happening in your situation.

like image 72
Mikel Urkia Avatar answered Sep 19 '22 23:09

Mikel Urkia