Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java - GC running but not reclaiming anything

Over the last couple of days, we are seeing JVM's on our servers get into a state where they are spending 100% CPU time in GC of OldGen, when:

A. They don't need to because there is plenty of space left on the heap and

B. They are not reclaiming anything.

I know they are in GC by looking at a stack trace and correlating the ThreadID's in ProcessExplorer to those in the stack dump. Each GC thread is occupying around 4% CPU.

Servers are running 16 gig heaps (32gig physical RAM) & have 8 cores. Up-time is usually around 30 days, with reboots required only due to MS patching requirements, but currently they are crashing at the 20 day mark.

Here is a graph of Duration, time scale = 19 days. http://i45.tinypic.com/257qalu.png

Here is a zoom of the tail of that graph http://i48.tinypic.com/2duiccw.png

As you can see, duration increases dramatically.

Here is a graph of the heap usage after GC. http://i48.tinypic.com/znna4h.png

If it was a typical memory leak, I would expect to see the orange peaks get higher and higher until they can't peak any more, but as this graph shows, there is plenty of heap space left.

I've got heap dumps for each server and nothing stands out as being problematic. There are a few ehCache stores, I can see our application code, ie, just "normal stuff"

The biggest change we made ~20 days ago was to implement a vendor patch that changed an internal cache from an unbounded hashmap which used hard references (and an obvious memory leak) to one that consisted of soft references & I am wondering if this is the cause, ie, somehow there is a massive overhead in managing these soft references after a point?

Does anyone have any ideas as to where to look next, or can someone confirm my soft reference theory?

Here is my jvm.args:

java.args=-server -Xms16000m -Xmx16000m -Dsun.io.useCanonCaches=false -XX:MaxPermSize=350m -Xloggc:e:/gcLogs/eRGCLogs.txt -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseParallelGC -XX:+UseParallelOldGC -Dnet.sf.ehcache.sizeof.filter=D:/jo3/java_ehCacheOpenSource/sizeOfExclusions.config -Xbatch -Dcoldfusion.rootDir={application.home}/../ -Dcoldfusion.libPath={application.home}/../lib -Dcoldfusion.classPath={application.home}/../lib/updates,{application.home}/../lib,{application.home}/../gateway/lib/,{application.home}/../wwwroot/WEB-INF/flex/jars,{application.home}/../wwwroot/WEB-INF/cfform/jars,d:/jo3/java,d:/JO3/java_ehCacheOpenSource/,D:/jo3/java_ehCacheMonitorProbe

We are on Coldfusion, which is kind of like a massive framework that sits on top of java.

JVM version: 1.6.0_29

As requested, "normal" GC logs look like:

2013-03-19T22:11:36.670+1100: 1288665.702: [GC [PSYoungGen: 4695800K->471119K(4722112K)] 9301727K->5077046K(15644800K), 0.3584434 secs] [Times: user=5.01 sys=0.00, real=0.36 secs] 2013-03-19T22:14:55.078+1100: 1288864.099: [GC [PSYoungGen: 4722063K->498009K(4783104K)] 9327990K->5103936K(15705792K), 0.3766889 secs] [Times: user=5.37 sys=0.00, real=0.38 secs] 2013-03-19T22:17:46.749+1100: 1289035.760: [GC [PSYoungGen: 4654489K->517299K(4673792K)] 9260416K->5123227K(15596480K), 0.4130828 secs] [Times: user=5.80 sys=0.00, real=0.41 secs] 2013-03-19T22:21:08.762+1100: 1289237.763: [GC [PSYoungGen: 4673779K->522660K(4738880K)] 9279707K->5143831K(15661568K), 0.4005516 secs] [Times: user=5.97 sys=0.00, real=0.40 secs] 2013-03-19T22:23:42.683+1100: 1289391.675: [GC [PSYoungGen: 4582628K->530998K(4590976K)] 9203799K->5186242K(15513664K), 0.4317352 secs] [Times: user=6.24 sys=0.00, real=0.43 secs] 2013-03-19T22:26:11.096+1100: 1289540.080: [GC [PSYoungGen: 4590966K->518331K(4724096K)] 9246210K->5206959K(15646784K), 0.3914401 secs] [Times: user=5.99 sys=0.00, real=0.39 secs] 2013-03-19T22:27:44.076+1100: 1289633.055: [GC [PSYoungGen: 2602730K->447527K(4732864K)] 7291358K->5208743K(15655552K), 0.3725317 secs] [Times: user=5.80 sys=0.00, real=0.37 secs] 2013-03-19T22:27:44.448+1100: 1289633.428: [Full GC (System) [PSYoungGen: 447527K->0K(4732864K)] [ParOldGen: 4761215K->4628296K(10922688K)] 5208743K->4628296K(15655552K) [PSPermGen: 352378K->352287K(352832K)], 4.2955639 secs] [Times: user=57.70 sys=0.06, real=4.30 secs] 2013-03-19T22:30:37.950+1100: 1289806.920: [GC [PSYoungGen: 4004416K->70948K(4690432K)] 8632712K->4699245K(15613120K), 0.1062227 secs] [Times: user=0.76 sys=0.00, real=0.11 secs] 2013-03-19T22:33:27.154+1100: 1289976.115: [GC [PSYoungGen: 4054116K->109175K(4092352K)] 8682413K->4737472K(15015040K), 0.1347919 secs] [Times: user=1.03 sys=0.00, real=0.13 secs] 2013-03-19T22:36:32.120+1100: 1290161.070: [GC [PSYoungGen: 4092343K->147318K(4712320K)] 8720640K->4775615K(15635008K), 0.1593523 secs] [Times: user=1.58 sys=0.00, real=0.16 secs] 2

When we are in failure mode, the GC logs look like:

2013-03-22T10:03:47.619+1100: 1504185.901: [GC [PSYoungGen: 0K->0K(5452736K)] 4413907K->4413907K(16375424K), 0.0114248 secs] [Times: user=0.16 sys=0.00, real=0.01 secs] 2013-03-22T10:03:47.631+1100: 1504185.912: [Full GC [PSYoungGen: 0K->0K(5452736K)] [ParOldGen: 4413907K->4412613K(10922688K)] 4413907K->4412613K(16375424K) [PSPermGen: 358399K->358278K(358400K)], 5.4435442 secs] [Times: user=73.74 sys=0.14, real=5.44 secs] 2013-03-22T10:03:53.145+1100: 1504191.426: [GC [PSYoungGen: 269219K->7734K(5449088K)] 4681833K->4422114K(16371776K), 0.0298728 secs] [Times: user=0.34 sys=0.00, real=0.03 secs] 2013-03-22T10:03:53.175+1100: 1504191.456: [Full GC [PSYoungGen: 7734K->0K(5449088K)] [ParOldGen: 4414379K->4415189K(10922688K)] 4422114K->4415189K(16371776K) [PSPermGen: 358399K->358371K(358400K)], 2.6033684 secs] [Times: user=36.33 sys=0.00, real=2.60 secs] 2013-03-22T10:03:55.788+1100: 1504194.069: [GC [PSYoungGen: 94969K->826K(5451328K)] 4510158K->4416015K(16374016K), 0.0133588 secs] [Times: user=0.16 sys=0.00, real=0.01 secs] 2013-03-22T10:03:55.802+1100: 1504194.082: [Full GC [PSYoungGen: 826K->0K(5451328K)] [ParOldGen: 4415189K->4415348K(10922688K)] 4416015K->4415348K(16374016K) [PSPermGen: 358399K->358389K(358400K)], 2.7156884 secs] [Times: user=38.11 sys=0.00, real=2.71 secs] 2

like image 523
Dave Avatar asked Mar 22 '13 03:03

Dave


1 Answers

As many have mentioned in the comments, running out of space in the PermGen is very likely to be your cause. This can be caused by excessive interning of Strings throughout the code, which can lead to the PermGen "exploding" - also loading a lot of classes (usually via frameworks that do it for you in the background) may also cause this.

Also, as one of the comments mentioned - using the CMS collection (Concurrent Mark and Sweep) can reduce your Stop the World GCs assuming the problem is with the Capacity of the Old generation. It can also improve your performance by reducing latency which is good regardless of the current problem.

Also, if you post fragments of the GC log that can be helpful to pointing you to the right direction.

About the jstat tool, you can use it in the following manner to get useful information:

jstat -gcutil <pid> <interval> 

I usually use a 1000ms interval. -gcutil gives you GC Utilization (in %) - so you can see if any generation is approaching 100%.

You can also use jstat -gc <pid> ... and get the exact capacity of your Old generation.


EDIT: After reviewing GC Logs

According to your GC Logs it confirms the original premise that your PermGen is filling up. In the time range between 10:03:47 and 10:03:55 I can see that the PermGen is constantly reaching it's maximum and than GC somehow removes around 10-100 KB of data:

See the below:

2013-03-22T10:03:47.631+1100: 1504185.912: [Full GC [... [PSPermGen: 358399K->358278K(358400K)]...
2013-03-22T10:03:53.175+1100: 1504191.456: [Full GC [... [PSPermGen: 358399K->358371K(358400K)]...
2013-03-22T10:03:55.802+1100: 1504194.082: [Full GC [... [PSPermGen: 358399K->358389K(358400K)]...

If you look at the Old and Young generations you can see they are both not reaching their maximum, the OldGen is consuming 4GB out of 10GB - so it is not the cause.

From the data you gathered I cannot tell if the rate at which the PermGen is filling up is consistent with the up-time of the process - meaning it looks like the PermGen should fill up in a day, not 20 days. So it is hard to say what is the definite solution, but here are some recommendations:

  1. Examine your code to ensure you are not abusing String's intern() method - if you use it too widely in the code for no good reason this can be your root cause.
  2. Check the frameworks you are using if they generate classes dynamically - this will consume the PermGen space as well, but to a certain extent.
  3. If you can, perform a weekly restart to your processes to prevent this downtime
  4. Consider increasing the PermGen space, but monitor it since increasing it might just prolong your 20day period, but not solve the problem. The PermGen should remain quite static after the process has been around long enough.
  5. Running a search on Google with the string Coldfusion PermGen produced many hits that report issues - try to follow those to focus you on your investigation.
like image 129
RonK Avatar answered Oct 08 '22 21:10

RonK