Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are failed spark executors a cause for concern?

Tags:

apache-spark

I understand that Apache Spark is designed around resilient data structures, but are failures expected during a running system or does this typically indicate a problem?

As I begin to scale the system out to different configurations, I see ExecutorLostFailure and No more replicas (See below). The system recovers and the program finishes.

Should I be concerned with this, and are there typically things we can do to avoid this; or is this expected as the number of executors grow?

18/05/18 23:59:00 WARN TaskSetManager: Lost task 87.0 in stage 4044.0 (TID 391338, ip-10-0-0-68.eu-west-1.compute.internal, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container marked as failed: container_1526667532988_0010_01_000012 on host: ip-10-0-0-68.eu-west-1.compute.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
18/05/18 23:59:00 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_193_7 !
18/05/18 23:59:00 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_582_50 !
18/05/18 23:59:00 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_401_91 !
18/05/18 23:59:00 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_582_186 !
18/05/18 23:59:00 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_115_139 !
like image 390
irbull Avatar asked May 19 '18 01:05

irbull


People also ask

What happens if a Spark executor fails?

If an executor runs into memory issues, it will fail the task and restart where the last task left off. If that task fails after 3 retries (4 attempts total by default) then that Stage will fail and cause the Spark job as a whole to fail.

Which of the following will cause a Spark job to fail?

Spark jobs might fail due to out of memory exceptions at the driver or executor end. When troubleshooting the out of memory exceptions, you should understand how much memory and cores the application requires, and these are the essential parameters for optimizing the Spark appication.

How can you recognize failure in your Spark job?

When a Spark job or application fails, you can use the Spark logs to analyze the failures. The QDS UI provides links to the logs in the Application UI and Spark Application UI. If you are running the Spark job or application from the Analyze page, you can access the logs via the Application UI and Spark Application UI.

Why do some executors Spark?

Executors in Spark are the worker nodes that help in running individual tasks by being in charge of a given spark job. These are launched at the beginning of Spark applications, and as soon as the task is run, results are immediately sent to the driver.


1 Answers

As I begin to scale the system out to different configurations, I see ExecutorLostFailure and No more replicas (See below). Should I be concerned with this?

You are right, this exception does not necessarily mean that something is wrong about your Spark job, because it will be thrown even in cases, where a server stopped working because of physical reasons (e.g. outage).

However, if you see multiple executor failures in your job, this is probably a signal that something can probably be improved. More specifically, the spark configuration contains a parameter called spark.task.maxFailures, which corresponds to the maximum number of failures for each task, after which a job will be considered as failed. As a result, in a well-behaved Spark job, you might see some executor failures, but they should be rare and you should rarely see a specific task failing multiple times, because then it probably means that it's not the fault of the executor, but the task is extremely heavy to deal with.

Are there typically things we can do to avoid this?

That depends a lot in the nature of your job. However, as said before the usual suspect is that the created task is too heavy for an executor (e.g. in terms of memory required). Spark creates a number of partitions for each RDD, based on several factors, such as the size of your cluster. However, if for example your cluster is quite small, Spark might create partitions that are very big in size and cause problems to the executors. So, you can try re-partitioning the RDDs in your code to enforce more, smaller partitions, which can be processed more easily.

like image 152
Dimos Avatar answered Sep 30 '22 18:09

Dimos