Spark tasks doesn't seem to be well distributed

1 Answers

Taking a closer look to the posted image, I can identify two main facts:

The number of tasks has been evenly distributed, with a maximum variation of 20 tasks.
The running time allocated for each executor differs significatively, from 3.0 mins (~80 tasks) to 17.0 min (~60 tasks).

This makes me wonder about the nature of your application. Are all the tasks equal or do some of them need more time to complete than others? If the tasks are heterogeneous, your issue needs to be looked more carefully. Imagine the following scenario:

Number of tasks: 20, where each one needs 10 seconds to finish except of the last one:

Task 01: 10 seconds
Task 02: 10 seconds
Task 03: 10 seconds
Task ...
Task 20: 120 seconds

Number of executors: 4 (each with a single core)

If we had to evenly distribute the tasks, we would see that each executor would have to process 5 tasks in total. Taking into account that one executor is assigned with the 20th tasks, which needs 120 seconds to complete, the execution flow would be the following:

By the second 40, each executor would be able to complete the first 4 tasks, considering that the 20th task is left at the end.
By the second 50, each executor but one will have finished all of their tasks. The remaining executor would still be computing the 20th tasks, which would finish completing after 120 seconds.

At the end, the user interface would show a result similar to yours, with the number of tasks evenly distributed but not the actual computing time.

Executor 01 -> tasks completed: 5 -> time: 0:50 minutes
Executor 02 -> tasks completed: 5 -> time: 0:50 minutes
Executor 03 -> tasks completed: 5 -> time: 0:50 minutes
Executor 04 -> tasks completed: 5 -> time: 2:40 minutes

Although not the same, a similar thing might be happening in your situation.

answered Sep 19 '22 23:09

Mikel Urkia

Related questions
                            
                                When to use SPARK_CLASSPATH or SparkContext.addJar
                            
                                Avoid "Task not serialisable" with nested method in a class
                            
                                Spark - Remote Akka Client Disassociated
                            
                                Is it possible to connect any RDBMS through spark usinig java?
                            
                                Convert RDD of Vector in LabeledPoint using Scala - MLLib in Apache Spark
                            
                                it is very slow for spark RDD union
                            
                                Why IDEA can't recognize the Spark jar file?
                            
                                Memory efficient way of union a sequence of RDDs from Files in Apache Spark
                            
                                Is it feasible to keep millions of keys in state of Spark Streaming job for two months?
                            
                                What is the preferred way to avoid SQL injections in Spark-SQL (on Hive)
                            
                                Add a new line to a text file in Spark
                            
                                Integrating Apache Kafka with Apache Spark Streaming using Python
                            
                                constructing a graph from streaming data using spark streaming

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark tasks doesn't seem to be well distributed

Tags:

apache-spark

distributed

Edamame

People also ask

1 Answers

Mikel Urkia

Recent Activity

Donate For Us