Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark application kills executor

Tags:

apache-spark

I'm running spark cluster in standalone mode and application using spark-submit. In spark UI stage section I found executing stage with large execution time ( > 10h, when usual time ~30 sec). Stage have many failed tasks with error Resubmitted (resubmitted due to lost executor). There is executor with address CANNOT FIND ADDRESS in Aggregated Metrics by Executor section in the stage page. Spark tries to resubmit this task infinitely. If I kill this stage (my application rerun uncompleted spark jobs automatically) all continue working good.

Also I found some strange entries in spark logs (same time as stage execution start).

Master:

16/11/19 19:04:32 INFO Master: Application app-20161109161724-0045 requests to kill executors: 0
16/11/19 19:04:36 INFO Master: Launching executor app-20161109161724-0045/1 on worker worker-20161108150133
16/11/19 19:05:03 WARN Master: Got status update for unknown executor app-20161109161724-0045/0
16/11/25 10:05:46 INFO Master: Application app-20161109161724-0045 requests to kill executors: 1
16/11/25 10:05:48 INFO Master: Launching executor app-20161109161724-0045/2 on worker worker-20161108150133
16/11/25 10:06:14 WARN Master: Got status update for unknown executor app-20161109161724-0045/1

Worker:

16/11/25 10:06:05 INFO Worker: Asked to kill executor app-20161109161724-0045/1
16/11/25 10:06:08 INFO ExecutorRunner: Runner thread for executor app-20161109161724-0045/1 interrupted
16/11/25 10:06:08 INFO ExecutorRunner: Killing process!
16/11/25 10:06:13 INFO Worker: Executor app-20161109161724-0045/1 finished with state KILLED exitStatus 137
16/11/25 10:06:14 INFO Worker: Asked to launch executor app-20161109161724-0045/2 for app.jar
16/11/25 10:06:17 INFO SecurityManager: Changing view acls to: spark
16/11/25 10:06:17 INFO SecurityManager: Changing modify acls to: spark
16/11/25 10:06:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); users with modify permissions: Set(spark)

There is no problem with network connections because worker, master (logs above), driver running on the same machine.

Spark version 1.6.1

like image 641
Cortwave Avatar asked Dec 01 '16 12:12

Cortwave


1 Answers

Likely the interesting part of the log is this:

16/11/25 10:06:13 INFO Worker: Executor app-20161109161724-0045/1 finished with state KILLED exitStatus 137

Exit 137 strongly suggest a resource issue, either memory or cpu cores. Given that you can fix your issues by rerunning the stage it could be that somehow all cores are already allocated (maybe you also have some Spark shell running?). This is a common issue with standalone Spark setups (everything on one host).

Either way I would try the following things in order:

  1. Raise the storage memory faction spark.storage.memoryFraction to pre-allocate more memory for storage and prevent the system OOM killer to randomly give you that 137 on a big stage.
  2. Set a lower number of cores for your application to rule out something pre-allocating those cores before your stage is ran. You can do this via spark.deploy.defaultCores, set it to 3 or even 2 (on an intel quad-core assuming 8 vcores)
  3. Outright allocate more RAM to Spark -> spark.executor.memory needs to go up.
  4. Maybe you run into an issue with meta data cleanup here, also not unheard of in local deployments, in this case adding
    export SPARK_JAVA_OPTS +="-Dspark.kryoserializer.buffer.mb=10 -Dspark.cleaner.ttl=43200" to the end your spark-env.sh might do the trick by forcing the meta data cleanup to run more frequently

One of these should do the trick in my opinion.

like image 123
Armin Braun Avatar answered Oct 29 '22 11:10

Armin Braun