I'm not a Java newby but I know only a tiny bit about garbage collection. Now I would like to change that with some practical experiences. My goal is a latency of under 0.3 seconds, or in extreme cases 0.5 is okay too.
I have an application with -Xmx50gb (-Xms50gb) and set the other following GC options:
-XX:+UseG1GC -Xloggc:somewhere.gc.log -XX:+PrintGCDateStamps
But now I'm occasionally having long pauses over 5sec due to garbage collections and that although there seems to be enough memory available. One reason I found:
[GC pause (G1 Evacuation Pause) (young) 42G->40G(48G), 5.9409662 secs]
Why is GCG1 still doing a "stop the world" for this? (Or at least I see that it stops my application at exactly this time) And why does it do such a negative cleanup if it is not really necessary as there is over 12% of the available RAM free. Also I thought that the default value for -XX:MaxGCPauseMillis
is 200 milliseconds, why is this value violated by a factor of 29 or even 50 (see below)?
Another reason for a delay was:
[GC pause (Metadata GC Threshold) (young) (initial-mark) 40G->39G(48G), 10.4667233 secs]
This will be probably solved via this answer e.g. just increasing the metadata space -XX:MetaspaceSize=100M
BTW: Using JSE 1.8.0_91-b14
Update: detailed GC log of such an event
2016-08-12T09:20:31.589+0200: 1178.312: [GC pause (G1 Evacuation Pause) (young) 1178.312: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 3159, predicted base time: 1.52 ms, remaining time: 198.48 ms, target pause time: 200.00 ms]
1178.312: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 136 regions, survivors: 20 regions, predicted young region time: 1924.75 ms]
1178.312: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 136 regions, survivors: 20 regions, old: 0 regions, predicted pause time: 1926.27 ms, target pause time: 200.00 ms]
1185.330: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: recent GC overhead higher than threshold after GC, recent GC overhead: 21.83 %, threshold: 10.00 %, uncommitted: 0 bytes, calculated expansion amount: 0 bytes (20.00 %)]
1185.330: [G1Ergonomics (Concurrent Cycles) do not request concurrent cycle initiation, reason: still doing mixed collections, occupancy: 42580574208 bytes, allocation request: 0 bytes, threshold: 23592960000 bytes (45.00 %), source: end of GC]
1185.330: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 1 regions, reclaimable: 3381416 bytes (0.01 %), threshold: 5.00 %]
, 7.0181903 secs]
[Parallel Time: 6991.8 ms, GC Workers: 10]
[GC Worker Start (ms): Min: 1178312.6, Avg: 1178312.8, Max: 1178312.9, Diff: 0.2]
[Ext Root Scanning (ms): Min: 1.1, Avg: 1.5, Max: 2.3, Diff: 1.2, Sum: 15.0]
[Update RS (ms): Min: 0.0, Avg: 0.3, Max: 1.3, Diff: 1.3, Sum: 3.4]
[Processed Buffers: Min: 0, Avg: 2.1, Max: 5, Diff: 5, Sum: 21]
[Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.4]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.2, Max: 0.4, Diff: 0.4, Sum: 1.7]
[Object Copy (ms): Min: 6964.1, Avg: 6973.0, Max: 6989.5, Diff: 25.3, Sum: 69730.4]
[Termination (ms): Min: 0.0, Avg: 16.4, Max: 25.3, Diff: 25.3, Sum: 164.4]
[Termination Attempts: Min: 1, Avg: 3.2, Max: 13, Diff: 12, Sum: 32]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2]
[GC Worker Total (ms): Min: 6991.5, Avg: 6991.6, Max: 6991.7, Diff: 0.2, Sum: 69915.5]
[GC Worker End (ms): Min: 1185304.3, Avg: 1185304.3, Max: 1185304.3, Diff: 0.0]
[Code Root Fixup: 0.1 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.3 ms]
[Other: 26.0 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 25.3 ms]
[Ref Enq: 0.1 ms]
[Redirty Cards: 0.1 ms]
[Humongous Register: 0.2 ms]
[Humongous Reclaim: 0.0 ms]
[Free CSet: 0.2 ms]
[Eden: 2176.0M(2176.0M)->0.0B(2176.0M) Survivors: 320.0M->320.0M Heap: 40.6G(48.8G)->40.0G(48.8G)]
[Times: user=0.55 sys=46.58, real=7.02 secs]
Read here about it: Copying (Stop the World Event) - These are the stop the world pauses to evacuate or copy live objects to new unused regions. This can be done with young generation regions which are logged as [GC pause (young)]. Or both young and old generation regions which are logged as [GC Pause (mixed)].
One way is to increase the Java heap size. Look at the Garbage Collection subtab to estimate the heap size used by the application and change Xms and Xmx to a higher value. The bigger the Java heap, the longer time it is between GCs.
In G1, a Full GC is a single threaded Stop The World (STW) pause which will evacuate and compact all regions. There are three important pieces of information you can get from the Full GC logs. Moving on to the advanced flags, these options provide valuable insights into the values and thresholds.
Garbage collection (GC) is the process by which Java removes data that is no longer needed from memory. A garbage collection pause, also known as a stop-the-world event, happens when a region of memory is full and the JVM requires space to continue. During a pause all operations are suspended.
GC pause (G1 Evacuation Pause) — Evacuation Pause is a phase where live objects are copied from one region (young or young + old) to another region. (young) – indicates that this is a Young GC event.
Why is GCG1 still doing a "stop the world" for this?
Because G1 is not a pauseless collector, it is just a low-pause collector.
Also I thought that the default value for -XX:MaxGCPauseMillis is 200 milliseconds, why is this value violated by a factor of 29 or even 50 (see below)?
It is, but it's just a goal, not a guarantee. Many things can cause it to fail to meet that goal. You got a fairly large heap, this makes things more difficult, i.e. failures are easier to provoke.
Anyway, the GC tuning journey starts with enabling verbose GC logging via
-Xloggc:<path to gc log file>
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintGCDetails
Update: These options are for hotspot 8. 9 and later use unified logging which has a different parameter format.
and then running the resulting log through GCViewer to get a general overview and then going back to reading individual log entries (there are many answers/blog posts on this topic) to figure out what might be causing the worst behavior. Depending on the cause various remedies can be tried.
Some general understanding of how tracing garbage collectors work in general and G1 will be necessary to avoid cargo-culting.
My application has many allocations which could be easily called "humongous allocations".
If that actually is the cause then current VMs have some experimental options to reclaim them sooner.
[Object Copy (ms): Min: 6964.1, Avg: 6973.0, Max: 6989.5, Diff: 25.3, Sum: 69730.4] [Times: user=0.55 sys=46.58, real=7.02 secs]
This means it spends most of the time in the kernel when doing something that should mostly consist of memory accesses and not system calls. So swap activity or transparent huge pages are likely suspects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With