JVM Garbage Collector suddenly consumes 100% CPU after running for several hours

Question

I've got a strange problem in my Clojure app.

I'm using http-kit to write a websocket based chat application.

Client's are rendered using React as a single page app, the first thing they do when they navigate to the home page (after signing in) is create a websocket to receive things like real-time updates and any chat messages. You can see the site here: www.csgoteamfinder.com

The problem I have is after some indeterminate amount of time, it might be 30 minutes after a restart or even 48 hours, the JVM running the chat server suddenly starts consuming all the CPU. When I inspect it with NR (New Relic) I can see that all that time is being used by the garbage collector -- at this stage I have no idea what it's doing.

I've take a number of screenshots where you can see the effect.

New Relic showing memory usage over a 7 day period

More New Relic statistics

You can see a number of spikes, those spikes correspond to large increases in CPU usage because of the garbage collector. To free up CPU I usually have to restart the JVM, I have been relying on receiving a CPU alert from NR in my slack account to make sure I jump on these quickly....but I really need to get to the root of the problem.

My initial thought was that I was possibly holding onto the socket reference when the client closed it at their end, but this is not the case. I've been looking at socket count periodically and it is fairly stable.

Any ideas of where to start?

Kind regards, Jason.

Bunti · Accepted Answer

It's hard to imagine what could have caused such an issue. But at first what I would do is taking a heap dump at the time of crash. This can be enabled with -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<path_to_your_heap_dump> JVM args. As a general practice don't increase heap size more the size of physical memory available on your server machine. In some rare cases JVM is unable to dump heap space because process is doomed; in such cases you can use gcore(if you're on Linux, not sure about Windows).

Once you grab the heap dump, analyse it with mat, I have debugged such applications and this worked perfectly to pin down any memory related issues. Mat allows you to dissect the heap dump in depth so you're sure to find the cause of your memory issue if it is not the case that you have allocated very small heap space.

mattias · Answer

Things to try in order:

Reduce the heap size
Count the number of objects of each class, and see if the numbers makes sense
Do you have big byte[] that lives past generation 1?
Change or tune GC algorithm
Use high-availability, i.e. more than one JVM
Switch to Erlang

You have triggered a global GC. The GC time grows faster-than-linear depending on the amount of memory, so actually reducing the heap space will trigger the global GC more often and make it faster.

You can also experiment with changing GC algorithm. We had a system where the global GC went down from 200s (happened 1-2 times per 24 hours) to 12s. Yes, the system was at a complete stand still for 3 minutes, no the users were not happy :-) You could try -XX:+UseConcMarkSweepGC

http://www.fasterj.com/articles/oraclecollectors1.shtml

You will always have stops like this for JVM and similar; it is more about how often you will get it, and how fast the global GC will be. You should make a heap dump and get the count of the different objects of each class. Most likely, you will see that you have millions of one of them, somehow, you are keeping a pointer to them unnecessary in a ever growing cache or sessions or similar.

http://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/memleaks001.html#CIHCAEIH

You can also start using a high-availability solution with at least 2 nodes, so that when one node is busy doing GC, the other node will have to handle the total load for a time. Hopefully, you will not get the global GC on both systems at the same time.

Big binary objects like byte[] and similar is a real problem. Do you have those?

At some time, these needs to be compacted by the global GC, and this is a slow operation. Many of the data-processing JVM based solution actually avoid to store all data as plain POJOs on the heap, and implement heaps themselves in order to overcome this problem.

Another solution is to switch from JVM to Erlang. Erlang is near real time, and they got by not having the concept of a global GC of the whole heap. Erlang has many small heaps. You can read a little about it at

https://hamidreza-s.github.io/erlang%20garbage%20collection%20memory%20layout%20soft%20realtime/2015/08/24/erlang-garbage-collection-details-and-why-it-matters.html

Erlang is slower than JVM, since it copies data, but the performance is much more predictable. It is difficult to have both. I have a websocket Erlang based solution, and it really works well.

So you run into a problem that is expected and normal for JVM, Microsoft CLR and similar. It will get worse and more common during the next couple of years when heap sizes grows.

JVM Garbage Collector suddenly consumes 100% CPU after running for several hours

Tags:

java

garbage-collection

jvm

clojure

java-websocket

JMac

2 Answers

Bunti

mattias

Recent Activity

Donate For Us

JVM Garbage Collector suddenly consumes 100% CPU after running for several hours

Tags:

java

garbage-collection

jvm

clojure

java-websocket

JMac

2 Answers

Bunti

mattias

Related questions

Recent Activity

Donate For Us