We run a Spark Streaming job on AWS EMR. This job will run stably for anywhere between 10 and 14 hours, and then crash with no discernible errors in stderr, stdout, or Cloudwatch logs. After this crash, any attempts to restart the job will immediately fail with "'Cannot allocate memory' (errno=12)" (full message).
Investigation with both Cloudwatch metrics and Ganglia show that driver.jvm.heap.used is steadily growing over time.
Both of these observations led me to believe that some long-running component of Spark (i.e. above Job-level) was failing to free memory correctly. This is supported by the fact that restarting the hadoop-yarn-resourcemanager (as per here) causes heap usage to drop to "fresh cluster" levels.
If my assumption there is indeed correct - what would cause Yarn to keep consuming more and more memory? (If not - how could I falsify that?)
spark.streaming.unpersist defaults to true (although I've tried adding a manual rdd.unpersist() at the end of my job anyway just to check whether that has any effect - it hasn't been running long enough to tell definitively, yet)spark.yarn.am.extraJavaOptions suggests that, when running in yarn-client mode (which we are), spark.yarn.am.memory sets the maximum Yarn Application Manager heap memory usage. This value is not overridden in our job (so should be at the default of 512m), but both Cloudwatch and Ganglia clearly show driver heap usage in the Gigabytes.It turns out that the default SparkUI values here were much larger than our system could handle. After setting them down to 1/20th of the default values, the system has been running stably for 24 hours with no increase in heap usage over that time.
For clarity, the values that were edited were:
* spark.ui.retainedJobs=50
* spark.ui.retainedStages=50
* spark.ui.retainedTasks=500
* spark.worker.ui.retainedExecutors=50
* spark.worker.ui.retainedDrivers=50
* spark.sql.ui.retainedExecutions=50
* spark.streaming.ui.retainedBatches=50
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With