Yarn Heap usage growing over time

Question

We run a Spark Streaming job on AWS EMR. This job will run stably for anywhere between 10 and 14 hours, and then crash with no discernible errors in stderr, stdout, or Cloudwatch logs. After this crash, any attempts to restart the job will immediately fail with "'Cannot allocate memory' (errno=12)" (full message).

Investigation with both Cloudwatch metrics and Ganglia show that driver.jvm.heap.used is steadily growing over time.

Both of these observations led me to believe that some long-running component of Spark (i.e. above Job-level) was failing to free memory correctly. This is supported by the fact that restarting the hadoop-yarn-resourcemanager (as per here) causes heap usage to drop to "fresh cluster" levels.

If my assumption there is indeed correct - what would cause Yarn to keep consuming more and more memory? (If not - how could I falsify that?)

I see from here that spark.streaming.unpersist defaults to true (although I've tried adding a manual rdd.unpersist() at the end of my job anyway just to check whether that has any effect - it hasn't been running long enough to tell definitively, yet)
Here, the comment on spark.yarn.am.extraJavaOptions suggests that, when running in yarn-client mode (which we are), spark.yarn.am.memory sets the maximum Yarn Application Manager heap memory usage. This value is not overridden in our job (so should be at the default of 512m), but both Cloudwatch and Ganglia clearly show driver heap usage in the Gigabytes.

scubbo · Accepted Answer

It turns out that the default SparkUI values here were much larger than our system could handle. After setting them down to 1/20th of the default values, the system has been running stably for 24 hours with no increase in heap usage over that time.

For clarity, the values that were edited were:

* spark.ui.retainedJobs=50
* spark.ui.retainedStages=50
* spark.ui.retainedTasks=500
* spark.worker.ui.retainedExecutors=50
* spark.worker.ui.retainedDrivers=50
* spark.sql.ui.retainedExecutions=50
* spark.streaming.ui.retainedBatches=50

Yarn Heap usage growing over time

Tags:

heap-memory

apache-spark

hadoop-yarn

amazon-emr

spark-streaming

scubbo

1 Answers

scubbo

Recent Activity

Donate For Us

Yarn Heap usage growing over time

Tags:

heap-memory

apache-spark

hadoop-yarn

amazon-emr

spark-streaming

scubbo

1 Answers

scubbo

Related questions

Recent Activity

Donate For Us