Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark tasks stuck at RUNNING

Tags:

apache-spark

I'm trying to run a Spark ML pipeline (load some data from JDBC, run some transformers, train a model) on my Yarn cluster but each time I run it, a couple - sometimes one, sometimes 3 or 4 - of my executors get stuck running their first task set (that'd be 3 tasks for each of their 3 cores), while the rest run normally, checking off 3 at a time.

In the UI, you'd see something like this: Spark web UI screenshot

Some things I have observed so far:

  • When I set up my executors to use 1 core each with spark.executor.cores (i.e. run 1 task at a time), the issue does not occur;
  • The stuck executors always seem to be them ones that had to get some partitions shuffled to them in order to run the task;
  • The stuck tasks would ultimately get successfully speculatively executed by another instance;
  • Occasionally, a single task would get stuck in an executor that is otherwise normal, the other 2 cores would keep working fine, however;
  • The stuck executor instances look like everything is normal: CPU is at ~100%, plenty of memory to spare, the JVM processes are alive, neither Spark or Yarn log anything out of the ordinary and they can still receive instructions from the driver, such as "drop this task, someone else speculatively executed it already" -- though, for some reason, they don't drop it;
  • Those executors never get killed off by the driver, so I imagine they keep sending their heartbeats just fine;

Any ideas as to what may be causing this or what I should try?

like image 955
ktdrv Avatar asked Dec 11 '22 08:12

ktdrv


1 Answers

TLDR: Make sure your code is threadsafe and race condition-free before you blame Spark.

Figured it out. For posterity: was using an thread-unsafe data structure (a mutable HashMap). Since executors on the same machine share a JVM, this was resulting in data races that were locking up the separate threads/tasks.

The upshot: when you have spark.executor.cores > 1 (and you probably should), make sure your code is threadsafe.

like image 160
ktdrv Avatar answered May 26 '23 01:05

ktdrv