Overly-aggressive garbage collection dominating CPU

Question

I have reviewed other questions with similar titles to mine, but none of them seem to cover a situation quite like what I am expericiencing. My application starts normally, garbage collecting in a routine and expected way.

2018-05-21T20:08:41.136-0400: 19979.368: [GC (Allocation Failure) [PSYoungGen: 71364K->10997K(73728K)] 303964K->243661K(466944K), 0.0165899 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
2018-05-21T20:09:01.212-0400: 19999.444: [GC (Allocation Failure) [PSYoungGen: 71413K->11065K(73728K)] 304077K->243865K(466944K), 0.0121248 secs] [Times: user=0.02 sys=0.00, real=0.02 secs] 
2018-05-21T20:09:30.450-0400: 20028.682: [GC (Allocation Failure) [PSYoungGen: 71481K->12550K(73728K)] 304281K->245422K(466944K), 0.0133476 secs] [Times: user=0.02 sys=0.00, real=0.02 secs] 
2018-05-21T20:09:50.492-0400: 20048.723: [GC (Allocation Failure) [PSYoungGen: 72966K->10454K(73728K)] 305838K->243374K(466944K), 0.0141533 secs] [Times: user=0.02 sys=0.00, real=0.02 secs]

After a seemingly arbitrary amount of time, the garbage collection becomes extremely aggressive, running several times per second and consuming the entire runtime of the CPU. It remains in this state until the application is restarted.

2018-05-21T20:10:12.104-0400: 20070.335: [GC (Allocation Failure) [PSYoungGen: 70870K->10356K(73728K)] 303790K->243340K(466944K), 0.0193899 secs] [Times: user=0.02 sys=0.00, real=0.02 secs] 
2018-05-21T20:10:12.222-0400: 20070.453: [GC (Allocation Failure) [PSYoungGen: 70772K->2080K(72704K)] 303756K->235288K(465920K), 0.0090667 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2018-05-21T20:10:12.413-0400: 20070.645: [GC (Allocation Failure) [PSYoungGen: 61472K->1936K(73728K)] 294680K->235256K(466944K), 0.0081242 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2018-05-21T20:10:12.519-0400: 20070.751: [GC (Allocation Failure) [PSYoungGen: 61328K->1585K(81408K)] 294648K->235248K(474624K), 0.0053709 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2018-05-21T20:10:12.582-0400: 20070.813: [GC (Allocation Failure) [PSYoungGen: 67633K->1313K(82432K)] 301296K->235240K(475648K), 0.0080559 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2018-05-21T20:10:12.647-0400: 20070.878: [GC (Allocation Failure) [PSYoungGen: 67361K->1121K(92160K)] 301288K->235264K(485376K), 0.0052482 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2018-05-21T20:10:12.718-0400: 20070.950: [GC (Allocation Failure) [PSYoungGen: 76897K->801K(92672K)] 311040K->235256K(485888K), 0.0071820 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2018-05-21T20:10:12.792-0400: 20071.024: [GC (Allocation Failure) [PSYoungGen: 76577K->641K(105472K)] 311032K->235232K(498688K), 0.0070387 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 
2018-05-21T20:10:12.878-0400: 20071.109: [GC (Allocation Failure) [PSYoungGen: 89217K->32K(105472K)] 323808K->235249K(498688K), 0.0084592 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2018-05-21T20:10:12.962-0400: 20071.194: [GC (Allocation Failure) [PSYoungGen: 88608K->64K(119296K)] 323825K->235289K(512512K), 0.0066050 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]

While the CPU is thus engaged, my application doesn't work properly. The server has two CPUs, but the JVM only appears to use one of them, with pidstat reporting average usage of just over 100%.

What puzzles me most about this behavior is that the young generation is driven down to a small fraction of the space allocated to it and kept there, never getting to above 1 Mb after garbage collection.

How can I prevent this from happening, or what can I do to diagnose why this is happening? I'm not expert in tuning garbage collection, so I could use some veteran guidance.

My current configuration for the JVM is:

-XX:InitialHeapSize=268435456 
-XX:MaxHeapSize=2147483648 
-XX:+PrintGC 
-XX:+PrintGCDateStamps 
-XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps 
-XX:+UseCompressedClassPointers 
-XX:+UseCompressedOops 
-XX:+UseParallelGC

Stephen C · Accepted Answer

You are describing the symptoms of a garbage collection "death spiral".

Basically, if you have a heap where the space occupied by reachable objects is continually trending upwards, and a larger and larger percentage of the time is spent running the garbage collector. Eventually, either the heap will fill up completely, or the GC Overhead Limit will be breached. In either case an OOME will be thrown.

There are basically three approaches:

Periodically restart the application.
Increase the heap size.
Figure out why your application is using more and more heap space. Typically there is a storage leak of some kind.

Only the 3rd approach really solves the problem The others are "band-aid" solutions.

While the CPU is thus engaged, my application doesn't work properly. The server has two CPUs, but the JVM only appears to use one of them, with pidstat reporting average usage of just over 100%.

If the GC is under too much pressure (e.g. due to an "almost full" heap) then you are likely to find that it has to fall back to non-ergonomic modes of operation. For example, it may have decided to create only one background GC thread at JVM startup, that's fine normally but under extreme GC load that thread goes to 100%, and your application threads are then blocked.

Basically, the GC is not designed to work well when the heap is too small for the application's needs.

Overly-aggressive garbage collection dominating CPU

Tags:

java

garbage-collection

configuration

server

cpu-usage

Jeremy

1 Answers

Stephen C

Recent Activity

Donate For Us

Overly-aggressive garbage collection dominating CPU

Tags:

java

garbage-collection

configuration

server

cpu-usage

Jeremy

1 Answers

Stephen C

Related questions

Recent Activity

Donate For Us