Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Debugging "Managed memory leak detected" in Spark 1.6.0

Tags:

apache-spark

I've tried upgrading to Apache Spark 1.6.0 RC3. My application now spams these errors for nearly every task:

Managed memory leak detected; size = 15735058 bytes, TID = 830

I've set logging level for org.apache.spark.memory.TaskMemoryManager to DEBUG and see in the logs:

I2015-12-18 16:54:41,125 TaskSetManager: Starting task 0.0 in stage 7.0 (TID 6, localhost, partition 0,NODE_LOCAL, 3026 bytes)
I2015-12-18 16:54:41,125 Executor: Running task 0.0 in stage 7.0 (TID 6)
I2015-12-18 16:54:41,130 ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
I2015-12-18 16:54:41,130 ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
D2015-12-18 16:54:41,188 TaskMemoryManager: Task 6 acquire 5.0 MB for null
I2015-12-18 16:54:41,199 ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
I2015-12-18 16:54:41,199 ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
D2015-12-18 16:54:41,262 TaskMemoryManager: Task 6 acquire 5.0 MB for null
D2015-12-18 16:54:41,397 TaskMemoryManager: Task 6 release 5.0 MB from null
E2015-12-18 16:54:41,398 Executor: Managed memory leak detected; size = 5245464 bytes, TID = 6

How do you debug these errors? Is there a way to log stack traces for allocations and deallocations, so I can find what leaks?

I don't know much about the new unified memory manager (SPARK-10000). Is the leak likely my fault or is it likely a Spark bug?

like image 594
Daniel Darabos Avatar asked Dec 18 '15 15:12

Daniel Darabos


2 Answers

The short answer is that users are not supposed to see this message. Users are not supposed to be able to create memory leaks in the unified memory manager.

That such leaks happen is a Spark bug: SPARK-11293


But if you want to understand the cause of a memory leak, this is how I did it.

  1. Download the Spark source code and make sure you can build it and your build works.
  2. In TaskMemoryManager.java add extra logging in acquireExecutionMemory and releaseExecutionMemory: logger.error("stack trace:", new Exception());
  3. Change all the other debug logs to error in TaskMemoryManager.java. (Easier than figuring out logging configurations...)

Now you will see the full stack trace for all allocations and deallocations. Try to match them up and find the allocations without deallocations. You now have the stack trace for the source of the leak.

like image 70
Daniel Darabos Avatar answered Sep 24 '22 21:09

Daniel Darabos


I found this warn message too, but it was cause by "df.repartition(rePartNum,df("id"))". My df is empty, and the warn message's line equals to rePartNum. version: spark2.4 win10

like image 44
null Avatar answered Sep 22 '22 21:09

null