Getting an advance warning before full GC

Tags:

For the context of a soft real time system that should not pause for more than 200ms, We're looking for a way to have an advance warning before a Full GC is imminent. We realize we might not be able to avoid it, but we'd like to fail over to another node before the system stalls.

We've been able to come up with a scheme that will provide us with an advance warning, ahead of imminent full GC that may cause the system to stall for several seconds (which we need to avoid).

What we've been able to come up with relies on CMS free list statistics: -XX:PrintFLSStatistics=1. This prints free list statistics into the GC log after every GC cycle, including young GC, so the information is available at short intervals, and will appear even more frequently during intervals of high memory allocation rate. It probably costs a little in terms of performance, but our working assumption is that we can afford it.

The output to the log looks like so:

Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 382153298
Max   Chunk Size: 382064598
Number of Blocks: 28
Av.  Block  Size: 13648332
Tree      Height: 8

In particular, the maximum free chunk size is 382064598 words. With 64-bit words this should amount to just below 2915MB. This number has been decreasing very slowly, at a rate of roughly 1MB per hour.

It is our understanding that so long as the maximum free chunk size is larger than the young generation (assuming no humungous object allocation), every object promotion should succeed.

Recently, we've run a several-days-long stress tests, and have been seeing that CMS was able to maintain maximum chunk sizes upward of 94% of total old region space. The maximum free chunk size appears to be decreasing at a rate of less than 1MB/hour, which should be fine -- according to this we won't be hitting full GC any time soon, and the servers will likely be down for maintenance more frequently than full GC can occur.

In a previous test, at a time when the system was less memory efficient, we've been able to run the system for a good 10 hours. During the first hour, the maximum free chunk size has decreased to 100MB, where it stayed for over 8 hours. During the last 40 minutes of the run, the maximum free chunk size has decreased at a steady rate towards 0, when a full GC occurred -- this was very encouraging, because for that workload we seemed to be able to get a 40 minute advance warning (when the chunk size started a steady decline towards 0).

My question to you: assuming this all reflects a prolonged peak workload (workload at any given point in time in production will only be lower), does this sound like a valid approach? To what degree of reliability do you reckon we should be able to count on the maximum free chunk size statistic from the GC log?

We are definitely open for suggestions, but request that they be limited to solutions available on HotSpot (No Azul for us, at least for now). Also, G1 by itself is no solution unless we can come up with a similar metric that will give us advance warning before Full GCs, or any GCs that significantly exceed our SLA (and these can occasionally occur).

305

asked Apr 29 '13 07:04

nadavwr

1 Answers

I post here relevant excerpts from a very enlightening and encouraging answer by Jon Masamitsu from Oracle, which I got from the HotSpot GC mailing list ([email protected]) -- he works on HotSpot, so this is very good news indeed.

At any rate, the question remains open for now (I can't credit myself for quoting an email :-) ), so please add your suggestions!

Formatting: quotes from the original post are more heavily indented than Jon's response.

It is our understanding that so long as the maximum free chunk size is larger than the young generation (assuming no humungous object allocation), every object promotion should succeed.

To a very large degree this is correct. There are circumstances under which an object promoted from the young generation into the CMS generation will require more space in the CMS generation than it did in the young generation. I don't think this happens to a significant extent.

The above is very encouraging, since we can definitely dedicate some spare memory to protect against the rare cases he describes, and it sounds like we'd be doing fine otherwise.

<--snip-->

My question to you: assuming this all reflects a prolonged peak workload (workload at any given point in time in production will only be lower), does this sound like a valid approach? To what degree of reliability do you reckon we should be able to count on the maximum free chunk size statistic from the GC log?

The maximum free chunk size is exact at the time GC prints it, but it can be stale by the time you read it and make your decisions.

For our workloads, this metric is on a very slow downward spiral, so a little staleness won't hurt us.

<--snip-->

We are definitely open for suggestions, but request that they be limited to solutions available on HotSpot (No Azul for us, at least for now). Also, G1 by itself is no solution unless we can come up with a similar metric that will give us advance warning before Full GCs, or any GCs that significantly exceed our SLA (and these can occasionally occur).

I think that the use of maximum free chunk size as your metric is a good choice. It is very conservative (which sounds like what you want) and not subject to odd mixtures of object sizes.

For G1 I think you could use the number of completely free regions. I don't know if it is printed in any of the logs currently but it is probably a metric we maintain (or could easily). If the number of completely free regions decreases over time, it could signal that a full GC is coming.

Jon

Thank you Jon!

140

answered Oct 11 '22 16:10

nadavwr

Related questions
                            
                                Failed to launch JavaFX application with native bundle exe
                            
                                Are there code metrics that will cover variable scoping
                            
                                Validate value objects (inheritance) in java
                            
                                NullPointerException at org.hibernate.ejb.criteria.path.AbstractPathImpl.get [duplicate]
                            
                                why does m2e plugin for eclipse insert optional attribute to src and what does it do
                            
                                Java 7 to be run in 32 bit on mac
                            
                                Java inner class inconsistency between descriptor and signature attribute? (class file)
                            
                                Aspectj Pointcut for matching public method calls on annotated field
                            
                                Google Http Java Client parsing to enum
                            
                                Does the jprofiler "Hot Spot" view correctly account for CPU consumed by native code called through JNI?
                            
                                rJava generics type
                            
                                Daemon thread Java [duplicate]
                            
                                Java Object Memory Footprint - Visualvm and java.sizeOf measurement
                            
                                Android SQLite - New table VS. new DB
                            
                                Generics Puzzler
                            
                                Why does the Java Garbage Collector appear to do an aggressive run shortly after doing a less aggressive one?
                            
                                UDP Multicasting From Mobile to PC
                            
                                How to authenticate EWS Java API
                            
                                Mean of two ints (or longs) without overflow, truncating towards 0
                            
                                Python GUI from Java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Getting an advance warning before full GC

Tags:

java

performance

garbage-collection

real-time

nadavwr

People also ask

1 Answers

nadavwr

Recent Activity

Donate For Us