Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interpret differences in prstat vs. 'prstat -m' on Solaris

Tags:

solaris

I've been using prstat and prstat -m a lot to investigate performance issues lately, and I think I've basically understood the differences of sampling vs. microstate accounting in Solaris 10. So I don't expect both to always show the exactly same number.

Today I came across an occasion where the 2 showed so vastly different outputs, that I have problems interpreting them and making sense of the output. The machine is a heavily loaded 8-CPU Solaris 10, with several large WebSphere processes and an Oracle database. The system practically came to a halt today for about 15 minutes (load averages of >700). I had difficulties to get any prstat information, but was able to get some outputs from "prtstat 1 1" and "prtstat -m 1 1", issued shortly one after another.

The top lines of the outputs:

prstat 1 1:

    PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  8379 was      3208M 2773M cpu5    60    0   5:29:13  19% java/145
  7123 was      3159M 2756M run     59    0   5:26:45 7.7% java/109
  5855 app1     1132M   26M cpu2    60    0   0:01:01 7.7% java/18
 16503 monitor   494M  286M run     59   19   1:01:08 7.1% java/106
  7112 oracle     15G   15G run     59    0   0:00:10 4.5% oracle/1
  7124 oracle     15G   15G cpu3    60    0   0:00:10 4.5% oracle/1
  7087 app1       15G   15G run     58    0   0:00:09 4.0% oracle/1
  7155 oracle     96M 6336K cpu1    60    0   0:00:07 3.6% oracle/1
...
Total: 495 processes, 4581 lwps, load averages: 74.79, 35.35, 23.8

prstat -m 1 1 (some seconds later)

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP  
  7087 app1     0.1  56 0.0 0.2 0.4 0.0  13  30  96   2  33   0 oracle/1
  7153 oracle   0.1  53 0.0 3.2 1.1 0.0 1.0  42  82   0  14   0 oracle/1
  7124 oracle   0.1  47 0.0 0.2 0.2 0.0 0.0  52  77   2  16   0 oracle/1
  7112 oracle   0.1  47 0.0 0.4 0.1 0.0 0.0  52  79   1  16   0 oracle/1
  7259 oracle   0.1  45 9.4 0.0 0.3 0.0 0.1  45  71   2  32   0 oracle/1
  7155 oracle   0.0  42  11 0.0 0.5 0.0 0.1  46  90   1   9   0 oracle/1
  7261 oracle   0.0  37 9.5 0.0 0.3 0.0 0.0  53  61   1  17   0 oracle/1
  7284 oracle   0.0  32 5.9 0.0 0.2 0.0 0.1  62  53   1  21   0 oracle/1
...
Total: 497 processes, 4576 lwps, load averages: 88.86, 39.93, 25.51

I have a very hard time interpreting the output. prstat seems to tell me that a fair amount of Java processing is going on, together with some Oracle stuff, just as I would expect in normal situation. prtstat -m shows a machine totally dominated by Oracle, consuming huge amounts of system time, and the overall CPU being heavily overloaded (large numbers in LAT).

I'm inclined to believe the output of prstat -m, because that matches much mor closely to what the system felt like during this time. Totally sluggish, almost no more user request processing going on from WebSphere, etc. But why would prstat show so heavily differing numbers?

Any explanation of this would be welcome!

CU, Joe

like image 488
jammann Avatar asked Nov 30 '10 22:11

jammann


1 Answers

There's a known problem with prstat -m on Solaris in the way cpu usage figures are calculated - the value you see has been averaged over all threads (LWPs) in a process, and hence is far far too low for heavily multithreaded processes - such as your Java app servers, which can have hundreds of threads each (see your NLWP). Less than a dozen of them are probably CPU hogs hence the CPU usage by java looks "low". You'd need to call it with prstat -Lm to get the per-LWP (thread) breakdown to see that effect. Reference:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6780169

Without further performance monitoring data it's hard to give non-speculative explanations of what you've seen there. I assume lock contention within java. One particular workload that can cause this is heavily multithreaded memory mapped I/O, they'll all pile up on the process address space lock. But it could be a purely java userside lock of course. A plockstat on one of the java processes, and/or simple dtrace profiling would be helpful.

like image 119
FrankH. Avatar answered Oct 11 '22 23:10

FrankH.