A production environment became very slow recently. The cpu of the process took 200%. It kept working however. After I restarted the service it functioned normal again. I have several symptoms : The Par survivor space heap was empty for a long time and garbage collection took about 20% of the cpu time.
JVM options:
X:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError, -XX:+UseConcMarkSweepGC, - XX:+UseParNewGC, -XX:HeapDumpPath=heapdump.hprof, -XX:MaxNewSize=700m, -XX:MaxPermSize=786m, -XX:NewSize=700m, -XX:ParallelGCThreads=8, -XX:SurvivorRatio=25, -Xms2048m, -Xmx2048m Arch amd64 Dispatcher Apache Tomcat Dispatcher Version 7.0.27 Framework java Heap initial (MB) 2048.0 Heap max (MB) 2022.125 Java version 1.6.0_35 Log path /opt/newrelic/logs/newrelic_agent.log OS Linux Processors 8 System Memory 8177.964, 8178.0
More info in the attached pic When the problem occurred on the non-heap the used code cache and used cms perm gen dropped to half.
I took the info from the newrelic.
The question is why does the server start to work so slow.
Sometimes the server stops completely, but we found that there is a problem with PDFBox, when uploading some pdf and contains some fonts it crashes the JVM.
More info: I observed that every day the Old gen is filling up. Now I restart the server daily. After restart it's all nice and dandy but the old gen is filling up till next day and the server slows down till needs a restart.
When the eden space becomes full, minor gc takes place. During a minor GC event, objects surviving the eden space are moved to the survivor space.
Eden Space: The pool from which memory is initially allocated for most objects. Survivor Space: The pool containing objects that have survived the garbage collection of the Eden space.
You can use the parameter SurvivorRatio can be used to tune the size of the survivor spaces, but this is often not important for performance. For example, - XX:SurvivorRatio=6 sets the ratio between eden and a survivor space to 1:6.
G1 is a generational, incremental, parallel, mostly concurrent, stop-the-world, and evacuating garbage collector which monitors pause-time goals in each of the stop-the-world pauses. Similar to other collectors, G1 splits the heap into (virtual) young and old generations.
By Default CMS starts to collect concurrently if OldGen is 70%. If it can't free memory below this boundary, it will run permanently concurrent which will slow down operation significantly. If OldSpace is getting near full OldGen usage, it will panic and fall back to stop-the-world GC pause which can be very long (like 20 seconds). You probably need more headroom in OldGen (ensure your app does not leak memory ofc !). Additionally you can lower the threshold to start a concurrent collection (default 70%) using
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=50
this will trigger concurrent collection starting with 50% occupancy and increase chance CMS finishes GC in time. This will only help in case your allocation rate is too high, from your charts it looks like not-enough-headrooom/memleak + too high XX:CMSInitiatingOccupancyFraction. Give at least 500MB to 1 GB more OldGen space
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With