Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Yarn Heap usage growing over time

We run a Spark Streaming job on AWS EMR. This job will run stably for anywhere between 10 and 14 hours, and then crash with no discernible errors in stderr, stdout, or Cloudwatch logs. After this crash, any attempts to restart the job will immediately fail with "'Cannot allocate memory' (errno=12)" (full message).

Investigation with both Cloudwatch metrics and Ganglia show that driver.jvm.heap.used is steadily growing over time.

Both of these observations led me to believe that some long-running component of Spark (i.e. above Job-level) was failing to free memory correctly. This is supported by the fact that restarting the hadoop-yarn-resourcemanager (as per here) causes heap usage to drop to "fresh cluster" levels.

If my assumption there is indeed correct - what would cause Yarn to keep consuming more and more memory? (If not - how could I falsify that?)

  • I see from here that spark.streaming.unpersist defaults to true (although I've tried adding a manual rdd.unpersist() at the end of my job anyway just to check whether that has any effect - it hasn't been running long enough to tell definitively, yet)
  • Here, the comment on spark.yarn.am.extraJavaOptions suggests that, when running in yarn-client mode (which we are), spark.yarn.am.memory sets the maximum Yarn Application Manager heap memory usage. This value is not overridden in our job (so should be at the default of 512m), but both Cloudwatch and Ganglia clearly show driver heap usage in the Gigabytes.
like image 290
scubbo Avatar asked Dec 14 '25 07:12

scubbo


1 Answers

It turns out that the default SparkUI values here were much larger than our system could handle. After setting them down to 1/20th of the default values, the system has been running stably for 24 hours with no increase in heap usage over that time.

For clarity, the values that were edited were:

* spark.ui.retainedJobs=50
* spark.ui.retainedStages=50
* spark.ui.retainedTasks=500
* spark.worker.ui.retainedExecutors=50
* spark.worker.ui.retainedDrivers=50
* spark.sql.ui.retainedExecutions=50
* spark.streaming.ui.retainedBatches=50
like image 162
scubbo Avatar answered Dec 15 '25 19:12

scubbo