Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bad performance on Azure for Owin/IIS application

We measured some performnace tests and I noticed that the CPU is running a lot of time in kernel mode. I'd like to know why is that.

The application: it's classic Azure Cloud service web role where Owin is listening under the IIS and Owin itself serves just static files that are cached in memory (so there should be only a little performance penalty and everyting should be pretty fast). The content is copied via await stream.CopyToAsync(response.Body) to output stream.

The test itself looks like this in gatling:

val openLoginSet = exec(http("ROOT")
      .get("/")
      .headers(Headers105Test2.headers_0)
      .resources(
        http("MED: arrow-down-small.png").get(uriIconAssets + "/arrow-down-small.png").headers(Headers105Test2.headers_1),
        http("MED: arrow-up-small.png").get(uriIconAssets + "/arrow-up-small.png").headers(Headers105Test2.headers_1),
        http("MED: close-medium.png").get(uriIconAssets + "/close-medium.png").headers(Headers105Test2.headers_1),
        http("MED: decline-medium.png").get(uriIconAssets + "/decline-medium.png").headers(Headers105Test2.headers_1),
        http("MED: help-medium.png").get(uriIconAssets + "/help-medium.png").headers(Headers105Test2.headers_1),
        http("MED: submit-medium.png").get(uriIconAssets + "/submit-medium.png").headers(Headers105Test2.headers_1),
        http("MED: delete-medium.png").get(uriIconAssets + "/delete-medium.png").headers(Headers105Test2.headers_1),
        http("MED: en-us.js").get("/en-us.js").headers(Headers105Test2.headers_8),
        http("MED: cloud_logo_big.png").get("/assets/cloud_logo_big.png").headers(Headers105Test2.headers_1),
        http("MED: favicon.ico").get("/favicon.ico").headers(Headers105Test2.headers_0))

val httpProtocol = http
  .baseURL("https://myurl.com")
  .inferHtmlResources()

val openLoginSenario = scenario("OpenOnly").exec(repeat(400, "n") {
    exec(openLoginSet).pause(3,6)
})

setUp(openLoginSenario.inject(rampUsers(150) over (3 minutes)))
  .protocols(httpProtocol)
  .maxDuration(3 minutes)

(I shortened the test to run 3 minutes just to catch data to show here) There are 3 computers that run this gatling test, each up to 150 concurrent threads, so 450 threads in total.

What I see is that there is a lot running code in kernel and W3wp process doesn't take most of the CPU:

Captured CPU when the test just started (the cpu is rising when new threads are added):

The test just started

Captured CPU when the tests are nearly before end:

Before end of the test

The kernel mode looks pretty bad and I'm not sure what might cause it. There should be almost no locks involved. When reading what else might cause the high kernel mode, I found that DPCs might cause it. So I captured some DPC data as well, but I'm not sure what's normal and what not. Anyway, the graph with DPC max times is also included in the sshot.

The vmbus.sys takes most significant time from all DPCs. That means that the Azure instance is not any bare metal (not suprising) and that the instance shares it's computional power with others. As I understand it, vmbus.sys is responsible for communication between e.g. network card itself and the hosted HyperV instance. Might running in HyperV be the main cause for low performance?

I'd like to know where to look and how to find out what causes the kernel mode in my situation.


Some more data:

Part of DPC data when the test started (taken in 30 sec):

Total = 17887 for module vmbus.sys
Elapsed Time, >        0 usecs AND <=        1 usecs,    137, or   0.77%
Elapsed Time, >        1 usecs AND <=        2 usecs,   2148, or  12.01%
Elapsed Time, >        2 usecs AND <=        4 usecs,   3941, or  22.03%
Elapsed Time, >        4 usecs AND <=        8 usecs,   2291, or  12.81%
Elapsed Time, >        8 usecs AND <=       16 usecs,   5182, or  28.97%
Elapsed Time, >       16 usecs AND <=       32 usecs,   3305, or  18.48%
Elapsed Time, >       32 usecs AND <=       64 usecs,    786, or   4.39%
Elapsed Time, >       64 usecs AND <=      128 usecs,     85, or   0.48%
Elapsed Time, >      128 usecs AND <=      256 usecs,      6, or   0.03%
Elapsed Time, >      256 usecs AND <=      512 usecs,      1, or   0.01%
Elapsed Time, >      512 usecs AND <=     1024 usecs,      2, or   0.01%
Elapsed Time, >     1024 usecs AND <=     2048 usecs,      0, or   0.00%
Elapsed Time, >     2048 usecs AND <=     4096 usecs,      1, or   0.01%
Elapsed Time, >     4096 usecs AND <=     8192 usecs,      2, or   0.01%
Total,                                                 17887

Part of DPC data when the test was ending (taken in 30 sec):

Total = 141796 for module vmbus.sys
Elapsed Time, >        0 usecs AND <=        1 usecs,   7703, or   5.43%
Elapsed Time, >        1 usecs AND <=        2 usecs,  21075, or  14.86%
Elapsed Time, >        2 usecs AND <=        4 usecs,  17301, or  12.20%
Elapsed Time, >        4 usecs AND <=        8 usecs,  38988, or  27.50%
Elapsed Time, >        8 usecs AND <=       16 usecs,  32028, or  22.59%
Elapsed Time, >       16 usecs AND <=       32 usecs,  11861, or   8.36%
Elapsed Time, >       32 usecs AND <=       64 usecs,   7034, or   4.96%
Elapsed Time, >       64 usecs AND <=      128 usecs,   5038, or   3.55%
Elapsed Time, >      128 usecs AND <=      256 usecs,    606, or   0.43%
Elapsed Time, >      256 usecs AND <=      512 usecs,     53, or   0.04%
Elapsed Time, >      512 usecs AND <=     1024 usecs,     26, or   0.02%
Elapsed Time, >     1024 usecs AND <=     2048 usecs,     11, or   0.01%
Elapsed Time, >     2048 usecs AND <=     4096 usecs,     10, or   0.01%
Elapsed Time, >     4096 usecs AND <=     8192 usecs,     53, or   0.04%
Elapsed Time, >     8192 usecs AND <=    16384 usecs,      3, or   0.00%
Elapsed Time, >    16384 usecs AND <=    32768 usecs,      1, or   0.00%
Elapsed Time, >    32768 usecs AND <=    65536 usecs,      5, or   0.00%
Total,                                                141796

% DPC Time from start to end of the test

enter image description here

We also suspected that we reached the network limits - so the tests 'download' so much data that the network adapter's limits are reached. This might be true during end of the test (when there are maximal number of threads), but this doesn't explain why the there is so much kernel mode time even at the beginning of the test.

Just to show how much data is sent - the volume of sent data (cyan line) is 2 orders of magnitude lower than capacity of the network adapter.

enter image description here

like image 707
stej Avatar asked May 05 '16 14:05

stej


1 Answers

This might not help you directly. but we had some performance issues after moving an application to cloud. Please find the discussion here:

Huge performance drop after moving to Azure

After a lot of investigation finally we found out our issue was with the Transient Fault Handling mechanism. It used to go and read the web.config each time causing huge CPU usage which was not an issue with same code in non-cloud environment. We handled it by using a Singleton pattern around it.

Hope it helps you to find out if there is any such issues with the application.

Cheers. :)

like image 50
SamGhatak Avatar answered Nov 12 '22 13:11

SamGhatak