Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CPU usage not maximized and high synchronization in server app relying on async/await

I am currently performing some benchmarks of a server application I have developed, heavily relying on C#5 async/await constructs.

This is a Console app so there is no synchronization context and no threads are explicitly being created in the code. The application is dequeuing requests from an MSMQ queue as fast as it can (asynchronous dequeuing loop ), and treat each requests before sending the processed requests via an HttpClient.

I/Os relying on async/await are dequeuing from MSMSQ, reading data / writing data to an SQL server DB and finally HttpClient request sending in the end of the chain.

Currently, for my benchmarks, the DB is completely faked (results are directly returned via Task.FromResult) and the HttpClient is faked as well (await a random Task.Delay between 0-50 ms and return a response), so the real only I/O is the dequeuing from MSMQ.

I already have improved the throughput of the application a lot by seeing that a lot of time was spent in GC, so I used CLR Profiler and found out where I could optimize things.

I am now trying to see if I can still improve throughput, and I think it may be possible.

There are two things I don't understand and maybe there is some througput improvement possibility behind this :

1) I have 4 CPU cores (well in fact just 2 real ones ... i7 CPU), and when the application runs, it only uses 3 CPU cores at most (in VS2012 concurrency visualizer I can clearly see that only 3 cores are being used, and in windows perfmon I can see CPU usage peeking at ~75/80%). Any idea why ? I have no control over the threads as I am not explicitely creating them, only relying on Tasks, so why does the Task scheduler doesn't maximize CPU usage in my case ? Anyone experienced this ?

2) Using VS2012 concurency visualizer I can see a very high synchronization time (approx 20% execution and 80% synchronization). F.Y.I Approx 15 threads are being created.

Approx 60% of the synchronization is coming from the following call stack :

clr.dll!ThreadPoolMgr::WorkerThreadStart
clr.dll!CLRSemaphore::Wait
kernelbase.dll!WaitForSingleObjectEx

and

clr.dll!ThreadPoolMgr::WorkerThreadStart
clr.dll!ThreadPoolMgr::UnfairSemaphore::Wait
clr.dll!CLRSemaphore::Wait 
kernelbase.dll!WaitForSingleObjectEx

And approx 30% of the synchronization is coming from :

clr.dll!ThreadPoolMgr::CompletionPortThreadStart
kernel32.dll!GetQueueCompletionStatusStub
kernelbase.dll!GetQueuedCompletionStatus
ntdll.dll!ZwRemoveIoCompletion 
..... blablabla 
ntoskrnl.exe!KeRemoveQueueEx

I don't know if this is normal to experience such high synchronization or not.

EDIT : Based on Stephen answer, I am adding more details about my implementation :

Indeed my server is completely asynchronous. However some CPU work is done to process each message (not that much I admit, but still some). After a message is received from MSMQ queue, it is first deserialized (most of the CPU/memory cost seems to happen at this point), then it passes through various stages of processing / validation which cost some CPU, before finally reaching the "end of the pipe" where the processed message is sent to the outside world via an HttpClient.

My implementation is not waiting on a message to be fully processed before dequeuing the next one from the queue. Indeed, my message pump, dequeuing messages from the queue, is very simple and immediately "forwards" the message to be able to dequeue the next one. The simplified code looks like this (ommiting exception management, cancellation ...) :

while (true)
{
    var message = await this.queue.ReceiveNextMessageAsync();
    this.DeserializeDispatchMessageAsync();
}

private async void DeserializeDispatchMessageAsync()
{
    // Immediately yield to avoid blocking the asynchronous messaging pump
    // while deserializing the body which would otherwise impact the throughput.
    await Task.Yield();

    this.messageDispatcher.DispatchAsync(message).ForgetSafely();
}

The ReceiveNextMessageAsync is a custom method using a TaskCompletionSource as .NET MessageQueue was not offerred any async method in .NET Framework 4.5. So I am just using BeginReceive / EndReceive couple with a TaskCompletionSource.

This is one of the only places in my code where I don't await on an async method. The loop dequeues as fast as it can do. It does not even wait on the message deserialization (message deserialization is lazily done by .NET FCL implementation of Message, when accessing the Body property explicitely). I do a Task.Yield() immediately to fork the deserialization/message proccessing to another task and immediately free the loop.

Right now, in the context of my benches, as I was saying previsouly, all I/Os (DB access only) are faked. All calls to async methods to get data from the DB just return a Task.FromResult with fake data. There is something arround 20 DB calls during the processing of a message and they are all faked right now / synchronous. The only asynchrony point is at the end of the processing of a message, where it gets send via HttpClient. HttpClient sending is faked as well, but I am doing a random (0-50ms) "await Task.Delay" at this point. Anyway, due to the faking of the DB, each message processing can be seen as a single Task.

For my benchs I am storing approx 300K messages in the queue then I launch the server app. It dequeues quite fast, flooding the server app and all messages are processed concurrently. That's why I don't understand why I do not reach 100% CPU and 4 cores, but only 75% and 3 cores used (synchronization concerns aside).

When I only dequeue without doing any deserialization nor processing on messages (commenting out the call to DeserializeDispatchMessageAsync I reach a throughput of approx 20K messages / sec. When I do the whole processing, I reach a throuhgput of approx 10K messages / sec.

The fact that messages are dequeued fast from the queue and that message deserialization + processing is done in a separate task makes me visualize in my head a lot of Tasks (one per message) being queued on the Task Scheduler (Thread Pool here ... no synchronization context), so I would expect that the thread pool would dispatch all these messages to the max number of cores and all 4 cores fully busy for processing of all tasks, but I doesn't seem to be this way.

Anyway, any answer is welcome, I am looking for any idea/tips.

like image 535
darkey Avatar asked Jul 26 '13 22:07

darkey


1 Answers

It sounds like your server is almost completely asynchronous (async MSMQ, async DB, async HttpClient). So in that case I don't find your results surprising.

First, there is very little CPU work to do. I'd fully expect each of the thread pool threads to sit around most of the time waiting for work to do. Remember that no CPU is used during a naturally-asynchronous operation.

The Task returned by an asynchronous MSMQ/DB/HttpClient operation does not execute on a thread pool thread; it just represents the completion of an I/O operation. The only thread pool work you're seeing are the brief amounts of synchronous work inside the asynchronous methods, which usually just arrange the buffers for I/O.

As far as throughput goes, you do have some room to scale (assuming that your test was flooding your existing service). It may be that your code is just (asynchronously) retrieving a single value from MSMQ and then (asynchronously) processing it before retrieving another value; in that case, you'd definitely see improvement from continuously reading from the MSMQ. Remember that async code is asynchronous but it is still serialized; your async method may pause at any await.

If that's the case, you may benefit from setting up a TPL Dataflow pipeline (with MaxDegreeOfParallelism set to Unbounded) and running a tight loop that async reads from MSMQ and shoves the data in the pipeline. That would be easier than doing your own overlapping processing.

Update for edit:

I have a handful of suggestions:

  1. Use Task.Run instead of await Task.Yield. Task.Run has clearer intent.
  2. Begin/End wrappers can use Task.Factory.FromAsync instead of TCS, which gives you cleaner code.

But I don't see any reason why the last core wouldn't be used - barring the obvious reasons like the profiler or another app keeping it busy. What you should end up with is an async equivalent of dynamic parallelism, which is one of the situations the .NET thread pool was specifically designed to handle.

like image 158
Stephen Cleary Avatar answered Oct 28 '22 05:10

Stephen Cleary