Are failed spark executors a cause for concern?

Tags:

1 Answers

As I begin to scale the system out to different configurations, I see ExecutorLostFailure and No more replicas (See below). Should I be concerned with this?

You are right, this exception does not necessarily mean that something is wrong about your Spark job, because it will be thrown even in cases, where a server stopped working because of physical reasons (e.g. outage).

However, if you see multiple executor failures in your job, this is probably a signal that something can probably be improved. More specifically, the spark configuration contains a parameter called spark.task.maxFailures, which corresponds to the maximum number of failures for each task, after which a job will be considered as failed. As a result, in a well-behaved Spark job, you might see some executor failures, but they should be rare and you should rarely see a specific task failing multiple times, because then it probably means that it's not the fault of the executor, but the task is extremely heavy to deal with.

Are there typically things we can do to avoid this?

That depends a lot in the nature of your job. However, as said before the usual suspect is that the created task is too heavy for an executor (e.g. in terms of memory required). Spark creates a number of partitions for each RDD, based on several factors, such as the size of your cluster. However, if for example your cluster is quite small, Spark might create partitions that are very big in size and cause problems to the executors. So, you can try re-partitioning the RDDs in your code to enforce more, smaller partitions, which can be processed more easily.

152

answered Sep 30 '22 18:09

Dimos

Related questions
                            
                                How to balance my data across the partitions?
                            
                                How to update Spark MatrixFactorizationModel for ALS
                            
                                From DataFrame to RDD[LabeledPoint]
                            
                                Running PySpark on and IDE like Spyder?
                            
                                Apache Spark YARN mode startup takes too long (10+ secs)
                            
                                PySpark: StructField(..., ..., False) always returns `nullable=true` instead of `nullable=false`
                            
                                Spark Streaming: foreachRDD update my mongo RDD
                            
                                SparkStreaming, RabbitMQ and MQTT in python using pika
                            
                                Spark structured streaming - join static dataset with streaming dataset
                            
                                How to find which Java/Scala thread has locked a file?
                            
                                How to load streaming data from Amazon SQS?
                            
                                Does Spark maintain parquet partitioning on read?
                            
                                Spark Streaming mapWithState seems to rebuild complete state periodically
                            
                                Spark SQL: Why two jobs for one query?
                            
                                Spark Scala Split dataframe into equal number of rows
                            
                                TypeError: Column is not iterable - How to iterate over ArrayType()?
                            
                                Can't get a SparkContext in new AWS EMR Cluster
                            
                                Failing integration test for Apache Spark Streaming
                            
                                Generate metadata for parquet files
                            
                                Spark Write to S3 V4 SignatureDoesNotMatch Error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Are failed spark executors a cause for concern?

Tags:

apache-spark

irbull

People also ask

1 Answers

Dimos

Recent Activity

Donate For Us