Spark executor lost because of time out even after setting quite long time out value 1000 seconds

Tags:

apache-spark

I have written Spark job which seems to be working fine for almost an hour and after that executor start getting lost because of timeout I see the following in log statement

15/08/16 12:26:46 WARN spark.HeartbeatReceiver: Removing executor 10 with no recent heartbeats: 1051638 ms exceeds timeout 1000000 ms

I don't see any errors but I see above warning and because of it executor gets removed by YARN and I see Rpc client disassociated error and IOException connection refused and FetchFailedException

After executor gets removed I see it is again getting added and starts working and some other executors fails again. My question is is it normal for executor getting lost? What happens to that task lost executors were working on? My Spark job keeps on running since it is long around 4-5 hours I have very good cluster with 1.2 TB memory and good no of CPU cores.

To solve above time out issue I tried to increase time spark.akka.timeout to 1000 seconds but no luck. I am using the following command to run my Spark job. I am new to Spark. I am using Spark 1.4.1.

./spark-submit --class com.xyz.abc.MySparkJob --conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" --driver-java-options -XX:MaxPermSize=512m --driver-memory 4g --master yarn-client --executor-memory 25G --executor-cores 8 --num-executors 5 --jars /path/to/spark-job.jar

990

asked Aug 16 '15 18:08

Umesh K

1 Answers

What might happen is that the slaves cannot launch executor anymore, due to memory issue. Look for the following messages in the master logs:

15/07/13 13:46:50 INFO Master: Removing executor app-20150713133347-0000/5 because it is EXITED
15/07/13 13:46:50 INFO Master: Launching executor app-20150713133347-0000/9 on worker worker-20150713153302-192.168.122.229-59013
15/07/13 13:46:50 DEBUG Master: [actor] handled message (2.247517 ms) ExecutorStateChanged(app-20150713133347-0000,5,EXITED,Some(Command exited with code 1),Some(1)) from Actor[akka.tcp://[email protected]:59013/user/Worker#-83763597]

You might find some detailed java errors in the worker's log directory, and maybe this type of file: work/app-id/executor-id/hs_err_pid11865.log.

See http://pastebin.com/B4FbXvHR

This issue might be resolved by your application management of RDD's, not by increasing the size of the jvm's heap.

140

answered May 23 '23 00:05

Bacon

Related questions
                            
                                Spark Error : executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
                            
                                How does Pyspark Calculate Doc2Vec from word2vec word embeddings?
                            
                                When to execute REFRESH TABLE my_table in spark?
                            
                                Apache airflow - automation - how to run spark submit job with param
                            
                                PySpark.sql.filter not performing as it should
                            
                                ModuleNotFoundError in PySpark Worker on rdd.collect()
                            
                                RuntimeError: Unsupported type in conversion to Arrow: VectorUDT
                            
                                How to print the decision path / rules used to predict sample of a specific row in PySpark?
                            
                                Table loaded through Spark not accessible in Hive
                            
                                pyspark: Method isBarrier([]) does not exist
                            
                                PySpark error: AnalysisException: 'Cannot resolve column name
                            
                                What problems can arise from a Spark non-deterministic Pandas UDF
                            
                                attributeerror: 'AioClientCreator' object has no attribute '_register_lazy_block_unknown_fips_pseudo_regions'
                            
                                How to bundle many files in S3 using Spark
                            
                                Spark groupBy OutOfMemory woes
                            
                                How to set the number of partitions for newAPIHadoopFile?
                            
                                How to make Spark Streaming (Spark 1.0.0) read the latest data from Kafka (Kafka Broker 0.8.1)
                            
                                Cannot deploy local Spark job, worker fails with EndPointAssociationError
                            
                                How to configure automatic restart of the application driver on Yarn
                            
                                Derby version mismatch between Spark and Hive : Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With