Spark cluster full of heartbeat timeouts, executors exiting on their own

Tags:

apache-spark

My Apache Spark cluster is running an application that is giving me lots of executor timeouts:

10:23:30,761 ERROR ~ Lost executor 5 on slave2.cluster: Executor heartbeat timed out after 177005 ms
10:23:30,806 ERROR ~ Lost executor 1 on slave4.cluster: Executor heartbeat timed out after 176991 ms
10:23:30,812 ERROR ~ Lost executor 4 on slave6.cluster: Executor heartbeat timed out after 176981 ms
10:23:30,816 ERROR ~ Lost executor 6 on slave3.cluster: Executor heartbeat timed out after 176984 ms
10:23:30,820 ERROR ~ Lost executor 0 on slave5.cluster: Executor heartbeat timed out after 177004 ms
10:23:30,835 ERROR ~ Lost executor 3 on slave7.cluster: Executor heartbeat timed out after 176982 ms

However, in my configuration I can confirm I successfully increased the executor heartbeat interval: enter image description here

When I visit the logs of executors marked as EXITED (i.e.: the driver removed them when it couldn't get a heartbeat), it appears that executors killed themselves because they didn't receive any tasks from the driver:

16/05/16 10:11:26 ERROR TransportChannelHandler: Connection to /10.0.0.4:35328 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
16/05/16 10:11:26 ERROR CoarseGrainedExecutorBackend: Cannot register with driver: spark://[email protected]:35328

How can I turn off heartbeats and/or prevent the executors from timing out?

278

asked May 16 '16 17:05

crockpotveggies

3 Answers

Missing heartbeats and executors being killed by YARN is nearly always due to OOMs. You should inspect the logs on the individual executors (look for the text "running beyond physical memory"). If you have many executors and find it cumbersome to inspect all of the logs manually, I recommend monitoring your job in the Spark UI while it runs. As soon as a task fails, it will report the cause in the UI, so it's easy to see. Note that some tasks will report failure due to missing executors that have already been killed, so make sure you look at causes for each of the individual failing tasks.

Note also that most OOM problems can be solved quickly by simply repartitioning your data at appropriate places in your code (again look at the Spark UI for hints as to where there might be a need for a call to repartition). Otherwise, you might want to scale up your machines to accommodate the need for memory.

answered Oct 23 '22 23:10

Glennie Helles Sindholt

The answer was rather simple. In my spark-defaults.conf I set the spark.network.timeout to a higher value. Heartbeat interval was somewhat irrelevant to the problem (though tuning is handy).

When using spark-submit I was also able to set the timeout as follows:

$SPARK_HOME/bin/spark-submit --conf spark.network.timeout 10000000 --class myclass.neuralnet.TrainNetSpark --master spark://master.cluster:7077 --driver-memory 30G --executor-memory 14G --num-executors 7 --executor-cores 8 --conf spark.driver.maxResultSize=4g --conf spark.executor.heartbeatInterval=10000000 path/to/my.jar

answered Oct 23 '22 21:10

crockpotveggies

If you are using pyspark, changing spark context's configuration will solve this problem. You can set it as following (Note all mentioned time are in ms) and heartbeatInterval (default 10000) should be lesser than the timeout (default 120000)

conf = SparkConf().setAppName("applicaiton") \
.set("spark.executor.heartbeatInterval", "200000") \ 
.set("spark.network.timeout", "300000")
sc = SparkContext.getOrCreate(conf)
sqlcontext = SQLContext(sc)

Hope this solves your problem. If you face any further errors, vist the documentation here

answered Oct 23 '22 21:10

Natty

Related questions
                            
                                PySpark create new column with mapping from a dict
                            
                                DataFrame join optimization - Broadcast Hash Join
                            
                                How to exclude multiple columns in Spark dataframe in Python
                            
                                “value $ is not a member of StringContext” - Missing Scala plugin?
                            
                                Understanding Spark's caching
                            
                                Viewing the content of a Spark Dataframe Column
                            
                                Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill)
                            
                                Schema evolution in parquet format
                            
                                Spark Error:expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
                            
                                Spark SQL Row_number() PartitionBy Sort Desc
                            
                                Filtering a spark dataframe based on date
                            
                                Reading csv files with quoted fields containing embedded commas
                            
                                multiple SparkContexts error in tutorial
                            
                                Applying UDFs on GroupedData in PySpark (with functioning python example)
                            
                                DataFrame equality in Apache Spark
                            
                                How to bootstrap installation of Python modules on Amazon EMR?
                            
                                GroupBy column and filter rows with maximum value in Pyspark
                            
                                How do I read a Parquet in R and convert it to an R DataFrame?
                            
                                AttributeError: 'DataFrame' object has no attribute 'map'
                            
                                Number of partitions in RDD and performance in Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark cluster full of heartbeat timeouts, executors exiting on their own

Tags:

configuration

apache-spark

crockpotveggies

People also ask

3 Answers

Glennie Helles Sindholt

crockpotveggies

Natty

Recent Activity

Donate For Us