I'm profiling our C# .NET application, and I'm noticing that the method the method System.Runtime.CompilerServices.AsyncTaskMethodBuilder::Start
is appearing multiple times, taking up about 3-4 seconds of Self Time of my 1 minute sample (meaning that it's spending about 3-4 seconds in the task infrastructure).
I understand that this method is used by the compiler to implement the async
/await
language construct in C#. What, generally, is in it that would cause it to block or otherwise take up a lot of time? Is there any way to improve our method so that it spends less time in this infrastructure?
EDIT: Here's a somewhat lengthy, but still self-contained, code sample demonstrating the problem, essentially doing a parallel merge sort on two very large arrays:
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
namespace AsyncAwaitSelfTimeTest
{
class Program
{
static void Main(string[] args)
{
Random random = new Random();
int[] arrayOne = GenerateArray(50_000_000, random.Next);
double[] arrayTwo = GenerateArray(50_000_000, random.NextDouble);
Comparer<int> comparerOne = Comparer<int>.Create((a, b) =>
{
if (a < b) return -1;
else if (a > b) return 1;
else return 0;
});
Comparer<double> comparerTwo = Comparer<double>.Create((a, b) =>
{
if (a < b) return -1;
else if (a > b) return 1;
else return 0;
});
var sortTaskOne = Task.Run(() => MergeSort(arrayOne, 0, arrayOne.Length, comparerOne));
var sortTaskTwo = Task.Run(() => MergeSort(arrayTwo, 0, arrayTwo.Length, comparerTwo));
Task.WaitAll(sortTaskOne, sortTaskTwo);
Console.Write("done sorting");
}
static T[] GenerateArray<T>(int length, Func<T> getFunc)
{
T[] result = new T[length];
for (int i = 0; i < length; i++)
{
result[i] = getFunc();
}
return result;
}
static async Task MergeSort<T>(T[] array, int start, int end, Comparer<T> comparer)
{
if (end - start <= 16)
{
SelectionSort(array, start, end, comparer);
}
else
{
int mid = start + (end - start) / 2;
Task firstTask = Task.Run(() => MergeSort(array, start, mid, comparer));
Task secondTask = Task.Run(() => MergeSort(array, mid, end, comparer));
await Task.WhenAll(firstTask, secondTask);
int firstIndex = start;
int secondIndex = mid;
T[] dest = new T[end - start];
for (int i = 0; i < dest.Length; i++)
{
if (firstIndex >= mid)
{
dest[i] = array[secondIndex++];
}
else if (secondIndex >= end)
{
dest[i] = array[firstIndex++];
}
else if (comparer.Compare(array[firstIndex], array[secondIndex]) < 0)
{
dest[i] = array[firstIndex++];
}
else
{
dest[i] = array[secondIndex++];
}
}
dest.CopyTo(array, start);
}
}
static void SelectionSort<T>(T[] array, int start, int end, Comparer<T> comparer)
{
// note: using selection sort here to prevent time variability
for (int i = start; i < end; i++)
{
int minIndex = i;
for (int j = i + 1; j < end; j++)
{
if (comparer.Compare(array[j], array[minIndex]) < 0)
{
minIndex = j;
}
}
T temp = array[i];
array[i] = array[minIndex];
array[minIndex] = temp;
}
}
}
}
In the performance profile of this code, the two copies of System.Runtime.CompilerServices.AsyncTaskMethodBuilder::Start
(one for each generic value type) take up most of the self-processor time, with the two MergeSort
methods only taking up a very small fraction of the self-processor time. Similar behavior was also noted when Task.Run
was not used (thus utilizing only a single processor).
EDIT 2: Thank you very much for your answers so far. I originally thought that the fact that Task<TResult>
was being used was part of the problem (since it was being used in the original code), hence my structure of copying arrays instead of sorting in place. However, I recognize now that that is irrelevant, so I changed the code snippet above to instead do a merge sort in place. I have also reduced the task count by introducing a non-trivial sequential cutoff (doing selection sort for the sake of tightly bound time), as well as using Comparer
objects to prevent boxing allocations of the array elements (and thus reducing the profiling interference caused by the garbage collector).
However, the same pattern, that of AsyncTaskMethodBuilder::Start
taking significant self time, still exists and can still be found in the profiling results.
EDIT 3: To clarify, the answer I am/was looking for is not "why is this code slow?", but "why is the .NET profiler telling me that most of the cost is spent inside a method I have no control over?" The accepted answer helped me identify the problem, which was that most of the logic was inside a generated type that the profiler does not include.
The problem you got here is that you spawn that many tasks that you overload the normal task pool, causing .NET to spawn additional tasks. Since you keep creating new tasks all the way down to the length of the array being 1
. AsyncTaskMethodBuilder::Start
starts being a significant time consumer, once it needs to create new tasks to continue the execution and can't reuse tasks from the pool.
There are a couple of things you need to change to get some performance into your function:
First: Clean up you await
's
Task<T[]> firstTask = Task.Run(() => MergeSort(firstHalf));
Task<T[]> secondTask = Task.Run(() => MergeSort(secondHalf));
await Task.WhenAll(firstTask, secondTask);
T[] firstDest = await firstTask;
T[] secondDest = await secondTask;
This is already a problem. Remember that every await
counts. Event if the Task
is already completed, await
still splits the function at this point, releases the current Task
and continues the remaining function in a new Task
. This switch takes time. Not much, but this happens so often in your function and it will be measurable.
Task.WhenAll
already returns the result value you need.
Task<T[]> firstTask = Task.Run(() => MergeSort(firstHalf));
Task<T[]> secondTask = Task.Run(() => MergeSort(secondHalf));
T[][] dests = await Task.WhenAll(firstTask, secondTask);
T[] firstDest = dests[0];
T[] secondDest = dests[1];
This way you can reduce the amount of task switches in your function.
Second: Reduce the amount of Task
instances created.
Tasks are a great tool to distribute work on your different CPU cores, but you have to make sure that they are busy. There is a in creating a new Task
and you have to be sure that it is worth it.
I suggest adding a threshold to the point where new Task
s are created. If the portions you are working on get too small, you shouldn't create new Task
instances, but instead call the functions directly.
For example:
T[] firstDest;
T[] secondDest;
if (mid > 100)
{
Task<T[]> firstTask = Task.Run(() => MergeSort(firstHalf));
Task<T[]> secondTask = Task.Run(() => MergeSort(secondHalf));
T[][] dests = await Task.WhenAll(firstTask, secondTask);
firstDest = dests[0];
secondDest = dests[1];
}
else
{
firstDest = MergeSort(firstHalf);
secondDest = MergeSort(secondHalf);
}
You should try different values and see how this changes thing. 100
is just a value I'd start with, but you can choose any other. This will reduce the amount of tasks that don't have much to do by a significant number. Basically the value determines how much remaining work is acceptable for one task to be handled.
Lastly you should think about handling your array
instances differently. If you tell your function the start position and the length of the portion of the array they are expected to work in, you should be able to improve the performance even more, because you don't have to copy the arrays thousands of times.
Interesting. I took your sample and altered the MergeSort
async method to be non-async. Now a profiling session takes ~33 sec to complete (vs ~36 sec for the async version, both using Release config). The non-async version looks like this:
static Task<T[]> MergeSort<T>(T[] src) where T : IComparable<T>
{
if (src.Length <= 1)
{
return Task.FromResult(src);
}
else
{
int mid = src.Length / 2;
T[] firstHalf = new T[mid];
T[] secondHalf = new T[src.Length - mid];
Array.Copy(src, 0, firstHalf, 0, mid);
Array.Copy(src, mid, secondHalf, 0, src.Length - mid);
Task<T[]> firstTask = Task.Run(() => MergeSort(firstHalf));
Task<T[]> secondTask = Task.Run(() => MergeSort(secondHalf));
return Task.WhenAll(firstTask, secondTask).ContinueWith(
continuationFunction: _ =>
{
T[] firstDest = firstTask.Result;
T[] secondDest = secondTask.Result;
int firstIndex = 0;
int secondIndex = 0;
T[] dest = new T[src.Length];
for (int i = 0; i < dest.Length; i++)
{
if (firstIndex >= firstDest.Length)
{
dest[i] = secondDest[secondIndex++];
}
else if (secondIndex >= secondDest.Length)
{
dest[i] = firstDest[firstIndex++];
}
else if (firstDest[firstIndex].CompareTo(secondDest[secondIndex]) < 0)
{
dest[i] = firstDest[firstIndex++];
}
else
{
dest[i] = secondDest[secondIndex++];
}
}
return dest;
},
cancellationToken: System.Threading.CancellationToken.None,
continuationOptions: TaskContinuationOptions.ExecuteSynchronously,
scheduler: TaskScheduler.Default);
}
}
So, for this particular example, the async/await overhead seems to be ~3 seconds. This is more than I expected, but it's certainly not a bottleneck here.
Regarding this observation:
In the performance profile of this code, the two copies of System.Runtime.CompilerServices.AsyncTaskMethodBuilder::Start (one for each generic value type) take up most of the self-processor time, with the two MergeSort methods only taking up a very small fraction of the self-processor time.
I haven't analyzed the compiler-generated code for this particular async method, but I suspect MergeSort
just contains a short prolog/epilog code, while the actual CPU-intensive code is indirectly invoked by AsyncTaskMethodBuilder::Start
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With