Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

jHiccup analysis doesn't add up

I have the following jHiccup result.

jHiccup analysis graph

Obviously there are huge peaks of few secs in the graph. My app outputs logs every 100 ms or so. When I read my logs I never see such huge pauses. Also I can check the total time spent in GC from the JVM diagnostics and it says the following:

Time: 
2013-03-12 01:09:04
Used: 
 1,465,483 kbytes
Committed: 
 2,080,128 kbytes
Max: 
 2,080,128 kbytes
GC time: 
     2 minutes on ParNew (4,329 collections)

8.212 seconds on ConcurrentMarkSweep (72 collections)

The total big-GC time is around 8 seconds spread over 72 separate collections. All of them are below 200ms per my JVM hint to limit the pauses.

On the other hand I observed exactly one instance of network response time of 5 seconds in my independent network logs (wireshark). That implies the pauses exist, but they are not GC and they are not blocked threads or something that can be observed in profiler or thread dumps.

My question is what would be the best way to debug or tune this behavior?

Additionally, I'd like to understand how jHiccup does the measurement. Obviously it is not GC pause time.

like image 719
Vladimir Ralev Avatar asked Mar 11 '13 23:03

Vladimir Ralev


1 Answers

Glad to see you are using jHiccup, and that it seems to show reality-based hiccups.

jHiccup observes "hiccups" that would also be seen by application threads running on the JVM. It does not glean the reason - just reports the fact. Reasons can be anything that would cause a process to not run perfectly ready-to-run code: GC pauses are a common cause, but a temporary ^Z at the keyboard, or one of those "live migration" things across virtualized hosts would be observed just as well.. There are a multitude of possible reasons, including scheduling pressure at the OS or hypervisor level (if one exists), power management craziness, swapping, and many others. I've seen Linux file system pressure and Transparent Huge Page "background" defragmentation cause multi-second hiccups as well...

A good first step at isolating the cause of the pause is to use the "-c" option in jHiccup: It launches a separate control process (with an otherwise idle workload). If both your application and the control process show hiccups that are roughly correlated in size and time, you'll know you are looking for a system-level (as opposed to process-local) reason. If they do not correlate, you'll know to suspect the insides of your JVM - which most likely indicates your JVM paused for something big; either GC or something else, like a lock debiasing or a class-loading-deriven-deoptimization which can take a really long [and often unreported in logs] time on some JVMs if time-to-safepoint is long for some reason (and on most JVMs, there are many possible causes for a long time-to-safepoint).

jHiccup's measurement is so dirt-simple that it's hard to get wrong. The entire thing is less than 650 lines of java code, so you can look at the logic for yourself. jHiccup's HiccupRecorder thread repeatedly goes to sleep for 1msec, and when it wakes up it records any difference in time (from before the sleep) that is greater that 1msec as a hiccup. The simple assumption is that if one ready-to-run thread (the HiccupRecorder) did not get to run for 5 seconds, other threads in the same process also saw a similar sized hiccup.

As you note above, jHiccups observations seem to be corroborated in your independent network logs, where you saw a 5 seconds response time, Note that not all hiccups would have been observed by the network logs, as only requests actually made during the hiccups would have been observed by a network logger. In contrast, no hiccup larger than ~1msec can hide from jHiccup, since it will attempt a wakeup 1,000 times per second even with no other activity.

This may not be GC, but before you rule out GC, I'd suggest you look into the GC logging a bit more. To start with, a JVM hint to limit pauses to 200msec is useless on all known JVMs. A pause hint is the equivalent of saying "please". In addition, don't believe your GC logs unless you include -XX:+PrintGCApplicationStoppedTime in options (and Suspect them even then). There are pauses and parts of pauses that can be very long and go unreported unless you include this flag. E.g. I've seen pauses caused by the occasional long running counted loop taking 15 seconds to reach a safe point, where GC only reported only the .08 seconds part of the pause where it actually did some work. There are also plenty of pauses whose causes that are not considered part of "GC" and can thereby go unreported by GC logging flags.

-- Gil. [jHiccup's author]

like image 60
Gil Tene Avatar answered Oct 23 '22 20:10

Gil Tene