Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problems with Java garbage collector and memory

I am having a really weird issue with a Java application.

Essentially it is a web page that uses magnolia (a cms system), there are 4 instances available on production environment. Sometimes the CPU goes to 100% in a java process.

So, first approach was to make a thread dump, and check the offending thread, what I found was weird:

"GC task thread#0 (ParallelGC)" prio=10 tid=0x000000000ce37800 nid=0x7dcb runnable 
"GC task thread#1 (ParallelGC)" prio=10 tid=0x000000000ce39000 nid=0x7dcc runnable 

Ok, that is pretty weird, I have never had a problem with the garbage collector like that, so the next thing we did was to activate JMX and using jvisualvm inspect the machine: the heap memory usage was really high (95%).

Naive approach: Increase memory, so the problem takes more time to appear, result, on the restarted server with increased memory (6 GB!) the problem appeared 20 hours after restart while on other servers with less memory (4GB!) that had been running for 10 days, the problem took still a few more days to reappear. Also, I tried to use the apache access log from the server failing and use JMeter to replay the requests into a local server in an attemp to reproduce the error... it did not work either.

Then I investigated the logs a little bit more to find this errors

info.magnolia.module.data.importer.ImportException: Error while importing with handler [brightcoveplaylist]:GC overhead limit exceeded
at info.magnolia.module.data.importer.ImportHandler.execute(ImportHandler.java:464)
at info.magnolia.module.data.commands.ImportCommand.execute(ImportCommand.java:83)
at info.magnolia.commands.MgnlCommand.executePooledOrSynchronized(MgnlCommand.java:174)
at info.magnolia.commands.MgnlCommand.execute(MgnlCommand.java:161)
at info.magnolia.module.scheduler.CommandJob.execute(CommandJob.java:91)
at org.quartz.core.JobRunShell.run(JobRunShell.java:216)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)
    Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

Another example

    Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.Arrays.copyOf(Arrays.java:2894)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:407)
    at java.lang.StringBuilder.append(StringBuilder.java:136)
    at java.lang.StackTraceElement.toString(StackTraceElement.java:175)
    at java.lang.String.valueOf(String.java:2838)
    at java.lang.StringBuilder.append(StringBuilder.java:132)
    at java.lang.Throwable.printStackTrace(Throwable.java:529)
    at org.apache.log4j.DefaultThrowableRenderer.render(DefaultThrowableRenderer.java:60)
    at org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:87)
    at org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:413)
    at org.apache.log4j.AsyncAppender.append(AsyncAppender.java:162)
    at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
    at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
    at org.apache.log4j.Category.callAppenders(Category.java:206)
    at org.apache.log4j.Category.forcedLog(Category.java:391)
    at org.apache.log4j.Category.log(Category.java:856)
    at org.slf4j.impl.Log4jLoggerAdapter.error(Log4jLoggerAdapter.java:576)
    at info.magnolia.module.templatingkit.functions.STKTemplatingFunctions.getReferencedContent(STKTemplatingFunctions.java:417)
    at info.magnolia.module.templatingkit.templates.components.InternalLinkModel.getLinkNode(InternalLinkModel.java:90)
    at info.magnolia.module.templatingkit.templates.components.InternalLinkModel.getLink(InternalLinkModel.java:66)
    at sun.reflect.GeneratedMethodAccessor174.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:622)
    at freemarker.ext.beans.BeansWrapper.invokeMethod(BeansWrapper.java:866)
    at freemarker.ext.beans.BeanModel.invokeThroughDescriptor(BeanModel.java:277)
    at freemarker.ext.beans.BeanModel.get(BeanModel.java:184)
    at freemarker.core.Dot._getAsTemplateModel(Dot.java:76)
    at freemarker.core.Expression.getAsTemplateModel(Expression.java:89)
    at freemarker.core.BuiltIn$existsBI._getAsTemplateModel(BuiltIn.java:709)
    at freemarker.core.BuiltIn$existsBI.isTrue(BuiltIn.java:720)
    at freemarker.core.OrExpression.isTrue(OrExpression.java:68)

Then I find out that such problem is due to the garbage collector using a ton of CPU but not able to free much memory

Ok, so it is a problem with the MEMORY that manifests itself in the CPU, so If the memory usage problem is solved, then the CPU should be fine, so I took a heapdump, unfortunatelly it was just too big to open it (the file was 10GB), anyway I run the server locallym loaded it a little bit and took a heapdump, after opening it, I found something interesting:

There are a TON of instances of

AbstractReferenceMap$WeakRef  ==> Takes 21.6% of the memory, 9 million instances
AbstractReferenceMap$ReferenceEntry  ==> Takes 9.6% of the memory, 3 million instances

In addition, I have found a Map which seems to be used as a "cache" (horrible but true), the problem is that such map is NOT synchronized and it is shared among threads (being static), the problem could be not only concurrent writes but also the fact that with lack of synchronization, there is no guarantee that thread A will see the changes done to the map by thread B, however, I am unable to figure out how to link this suspicious map using the memory eclipse analyzer, as it does not use the AbstracReferenceMap, it is just a normal HashMap.

Unfortunately, we do not use those classes directly (obviously the code uses them, but not directly), so I have seem to hit a dead end.

Problems for me are

  1. I cannot reproduce the error
  2. I cannot figure out where the hell the memory is leaking (if that is the case)

Any ideas at all?

like image 227
Juan Antonio Gomez Moriano Avatar asked Feb 27 '14 22:02

Juan Antonio Gomez Moriano


2 Answers

The 'no-op' finalize() methods should definitely be removed as they are likely to make any GC performance problems worse. But I suspect that you have other memory leak issues as well.

Advice:

  • First get rid of the useless finalize() methods.

  • If you have other finalize() methods, consider getting rid of them. (Depending on finalization to do things is generally a bad idea ...)

  • Use memory profiler to try to identify the objects that are being leaked, and what is causing the leakage. There are lots of SO Questions ... and other resources on finding leaks in Java code. For example:

    • How to find a Java Memory Leak
    • Troubleshooting Guide for Java SE 6 with HotSpot VM, Chapter 3.

Now to your particular symptoms.

First of all, the place where the OutOfMemoryErrors were thrown is probably irrelevant.

However, the fact that you have huge numbers of AbstractReferenceMap$WeakRef and AbstractReferenceMap$ReferenceEntry objects is a string indication that something in your application or the libraries it is using is doing a huge amount of caching ... and that that caching is implicated in the problem. (The AbstractReferenceMap class is part of the Apache Commons Collections library. It is the superclass of ReferenceMap and ReferenceIdentityMap.)

You need to track down the map object (or objects) that those WeakRef and ReferenceEntry objects belong to, and the (target) objects that they refer to. Then you need to figure out what is creating it / them and figure out why the entries are not being cleared in response to the high memory demand.

  • Do you have strong references to the target objects elsewhere (which would stop the WeakRefs from being broken)?

  • Is / are the map(s) being used incorrectly so as to cause a leak. (Read the javadocs carefully ...)

  • Are the maps being used by multiple threads without external synchronization? That could result in corruption, which potentially could manifest as a massive storage leak.


Unfortunately, these are only theories and there could be other things causing this. And indeed, it is conceivable that this is not a memory leak at all.


Finally, your observation that the problem is worse when the heap is bigger. To me, this is still consistent with a Reference / cache-related issue.

  • Reference objects are more work for the GC than regular references.

  • When the GC needs to "break" a Reference, that creates more work; e.g. processing the Reference queues.

  • Even when that happens, the resulting unreachable objects still can't be collected until the next GC cycle at the earliest.

So I can see how a 6Gb heap full of References would significantly increase the percentage of time spent in the GC ... compared to a 4Gb heap, and that could cause the "GC Overhead Limit" mechanism to kick in earlier.

But I reckon that this is an incidental symptom rather than the root cause.

like image 191
Stephen C Avatar answered Oct 11 '22 12:10

Stephen C


With a difficult debugging problem, you need to find a way to reproduce it. Only then will you be able to test experimental changes and determine if they make the problem better or worse. In this case, I'd try writing loops that rapidly create & delete server connections, that create a server connection and rapidly send it memory-expensive requests, etc.

After you can reproduce it, try reducing the heap size to see if you can reproduce it faster. But do that second since a small heap might not hit the "GC overhead limit" which means the GC is spending excessive time (98% by some measure) trying to recover memory.

For a memory leak, you need to figure out where in the code it's accumulating references to objects. E.g. does it build a Map of all incoming network requests? A web search https://www.google.com/search?q=how+to+debug+java+memory+leaks shows many helpful articles on how to debug Java memory leaks, including tips on using tools like the Eclipse Memory Analyzer that you're using. A search for the specific error message https://www.google.com/search?q=GC+overhead+limit+exceeded is also helpful.

The no-op finalize() methods shouldn't cause this problem but they may well exacerbate it. The doc on finalize() reveals that having a finalize() method forces the GC to twice determine that the instance is unreferenced (before and after calling finalize()).

So once you can reproduce the problem, try deleting those no-op finalize() methods and see if the problem takes longer to reproduce.

It's significant that there are many AbstractReferenceMap$WeakRef instances in memory. The point of a weak reference is to refer to an object without forcing it to stay in memory. AbstractReferenceMap is a Map that lets one make the keys and/or values be weak references or soft references. (The point of a soft reference is to try to keep an object in memory but let the GC free it when memory gets low.) Anyway, all those WeakRef instances in memory are probably exacerbating the problem but shouldn't keep the referenced Map keys/values in memory. What are they referring to? What else is referring to those objects?

like image 26
Jerry101 Avatar answered Oct 11 '22 14:10

Jerry101