We are running a RT system in Java. It often uses relatively large heaps (100+GB) and serves requests coming from message queue. Each request must be handled fast (<100ms) to meet the SLAs.
We are experiencing serious GC-related problems, because it often happens that GC causes stop-the-world collection during a request (200+ms), resulting in failure.
One of our developers with reasonable knowledge of GCs spent quite some time with tuning GC parameters and trying different GCs. After several days, he came up with some parametrization that we jokingly call "evolved by genetic algorithm". It lowers the GC pauses, but is still far from meeting the SLA requirements.
The solution I am looking for is to protect some critical parts of code from GC, and after a request is finished, let the GC do as much work as it needs, before taking next request. Occasional pauses outside the requests would be OK, because we have several workers and garbage-collecting workers would just not ask for requests for a while.
I have some ideas which are silly, ugly, and most probably not working, but hopefully they illustrate the problem:
Thread.sleep()
in the receiving thread, praying for the GC to do some work in the meantime,System.gc()
or Runtime.gc()
between requests, again hopelessly praying for it to help,The last important note is that we are a low-budget startup and commercial solutions such as Zing® are not an option for us, we are looking for a non-commercial solution.
Any ideas? We would rewrite our code entirely to C++ (we didn't know that GC might be a problem rather than solution at the beginning), but the code-base is too large already to do that.
Any ideas?
Use a different JVM? Azul claims to be able to handle such cases. Redhat and Oracle are contributing shenandoah and zgc to openjdk, respectively, with similar goals in mind, so maybe you could try experimental builds if you don't want a commercial solution.
There also are other JVMs focused on realtime applications, but as I understand it they focus on harder realtime requirements on smaller systems, yours sounds more like soft realtime requirements.
Another thing you can attempt is significantly reducing object allocations (profile your application!) by using pre-allocated objects or more compact data representations where applicable. Reducing allocation pressure while keeping the new gen size the same means increased mortality rate per collection, which should speed up young collections.
Choosing hardware to maximize memory bandwidth might help too.
Invoke System.gc() or Runtime.gc() between requests, again hopelessly praying for it to help,
This might work when combined with -XX:+ExplicitGCInvokesConcurrent
, otherwise it would trigger a single-threaded STW collection with CMS or G1 (I'm assuming you're using one of those). But that approach is seems brittle and requires lots of tuning and monitoring.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With