I am running a Spark job, and it seems that the tasks are not well distributed (see attached). Is there a way to make the tasks more evenly distributed? Thanks!
Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task.
Spark assigns one task per partition and each worker can process one task at a time.
Sometimes, Spark runs slowly because there are too many concurrent tasks running. The capacity for high concurrency is a beneficial feature, as it provides Spark-native fine-grained sharing. This leads to maximum resource utilization while cutting down query latencies.
Number of tasks execution in parallelNumber of CPU cores available for an executor determines the number of tasks that can be executed in parallel for an application for any given time.
Taking a closer look to the posted image, I can identify two main facts:
This makes me wonder about the nature of your application. Are all the tasks equal or do some of them need more time to complete than others? If the tasks are heterogeneous, your issue needs to be looked more carefully. Imagine the following scenario:
Number of tasks: 20, where each one needs 10 seconds to finish except of the last one:
Task 01: 10 seconds
Task 02: 10 seconds
Task 03: 10 seconds
Task ...
Task 20: 120 seconds
If we had to evenly distribute the tasks, we would see that each executor would have to process 5 tasks in total. Taking into account that one executor is assigned with the 20th tasks, which needs 120 seconds to complete, the execution flow would be the following:
At the end, the user interface would show a result similar to yours, with the number of tasks evenly distributed but not the actual computing time.
Executor 01 -> tasks completed: 5 -> time: 0:50 minutes
Executor 02 -> tasks completed: 5 -> time: 0:50 minutes
Executor 03 -> tasks completed: 5 -> time: 0:50 minutes
Executor 04 -> tasks completed: 5 -> time: 2:40 minutes
Although not the same, a similar thing might be happening in your situation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With