Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Async method returning a completed task unexpectedly slow

I have some c# code that runs fine on a webserver. The code uses async/await because it performs some network calls in production environment.

I also need to run some simulations on the code; the code gets called billions of times concurrently during the simulation. The simulations doesn't perform any network call: a mock is used which returns a value using Task.FromResult(). The values returned from the mock actually simulate every possible response from the network call that can be received in production environment.

I undestand there is some overhead using async/await, but I also expect that there should be not a huge difference in performance given that an already-completed task is returned and there should be no actual waiting.

But making some tests I noticed a big drop in performance (expecially on some hardware).

I tested the following code using LinqPad with compiler optimization turned on; you can remove the .Dump() call and paste the code in a console application if you want to test it directly in visual studio.

// SYNC VERSION

void Main()
{
    Enumerable.Range(0, 1_000_000_000)
        .AsParallel()
        .Aggregate(
            () => 0.0,
            (a, i) => Calc(a, i),
            (a1, a2) => a1 + a2,
            f => f
        )
        .Dump();
}

double Calc(double a, double i) => a + Math.Sin(i);

and

// ASYNC-AWAIT VERSION

void Main()
{
    Enumerable.Range(0, 1_000_000_000)
        .AsParallel()
        .Aggregate(
            () => 0.0,
            (a, i) => Calc(a, i).Result,
            (a1, a2) => a1 + a2,
            f => f
        )
        .Dump();
}


async Task<double> Calc(double a, double i) => a + Math.Sin(i);

The async-await version of the code exemplifies the situation of my simulation code.

I runs the simulations quite successfully on my i7 machine. But I get some very bad result when I try to run the code on a AMD ThreadRipper machine we have in our office.

I've run some benchmarks using the code above in linq pad both on my i7 machine and the AMD ThreadRipper and these are the results:

TEST on i7 quad-core 3,67 Ghz (windows 10 pro x64):

sync version: 15 sec (100% CPU)
async-await version: 20 sec (93% CPU)
TEST on AMD 32 cores 3,00 Ghz (windows server 2019 x64):

sync version: 16 sec (50% CPU)
async-await version: 140 sec (14% CPU)

I understand there are hardware differences (maybe the Intel hyperthreading is better, etc), but this question is not about the hardware performance.

Why there is not always 100% CPU usage (or 50% taking into account the worse case for CPU hyperthreading), but there is a drop in CPU usage in the async-await version of the code?

(the drop in CPU usage is sharper on the AMD but it's also present on the Intel)

Is there any workaround which doesn't involve the refactoring of all the async-await chain of calls all around the code? (the code base is big and complicated)

Thank you.

EDIT

As suggested in a comment I tried to use ValueTask insted of Task and it seems to solve the issue. I tried this directly in VS because I needed a nuget package (Release build) and these are the results:

TEST on i7

"sync" version: 16 sec (100% CPU)
"await Task" version: 49 sec (95% CPU)
"await ValueTask" version: 31 sec (100% CPU)

and

TEST on AMD

"sync" version: 15 sec (50% CPU)
"await Task" version: 125 sec (12% CPU)
"await ValueTask" version: 17 sec (50% CPU)

Honestly I don't know much about the ValueTask class and I'm going to study it. If you can explain/elaborate in an answer it is welcome.

Thank you.

like image 716
Valerio Natangelo Avatar asked Mar 03 '23 13:03

Valerio Natangelo


1 Answers

Your garbage collector is most probably configured to workstation mode (the default), which uses a single thread to reclaim the memory allocated by unused objects. For a machine with 32 cores, one core will certainly not be enough to clean up the mess that the rest 31 cores are constantly producing! So you should probably switch to server mode:

<configuration>
  <runtime>
    <gcServer enabled="true"></gcServer>
  </runtime>
</configuration>

Background server garbage collection uses multiple threads, typically a dedicated thread for each logical processor.

By using ValueTasks instead of Tasks you avoid memory allocations in the heap because the ValueTask is a struct that is allocated in the stack and has no need for garbage collection. But this is the case only if it wraps the result of a completed task. If it wraps an incomplete task then it offers no advantage. It is suitable for cases where you have to await tens of millions of tasks, and you expect that the vast majority of them will be completed.

like image 93
Theodor Zoulias Avatar answered Mar 15 '23 19:03

Theodor Zoulias