My application searches many files in parallel using regex, await Task.WhenAll(filePaths.Select(FindThings));
Inside of FindThings
, it spends most of it's time performing the regex search, as these files can be hundreds of mb in size.
static async Task FindThings(string path) {
string fileContent = null;
try
{
using (var reader = File.OpenText(path))
fileContent = await reader.ReadToEndAsync();
}
catch (Exception e)
{
WriteLine(lineIndex, "{0}: Error {1}", filename, e);
return;
}
var exitMatches = _exitExp.Matches(fileContent);
foreach (Match exit in exitMatches)
{
if (_taskDelay > 0)
await Task.Delay(_taskDelay);
// [...]
I'm getting a lot of responses that indicate I didn't clarify why this is important. Take this example program (that uses the Nitro.Async library):
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Nito.AsyncEx;
namespace Scrap
{
class Program
{
static void Main(string[] args)
{
AsyncContext.Run(() => MainAsync(args));
}
static async void MainAsync(string[] args)
{
var tasks = new List<Task>();
var asyncStart = DateTime.Now;
tasks.Add(Task.WhenAll(Enumerable.Range(0, 10).Select(i =>
ShowIndexAsync(i, asyncStart))));
var start = DateTime.Now;
tasks.Add(Task.WhenAll(Enumerable.Range(0, 10).Select(i =>
ShowIndex(i, start))));
await Task.WhenAll(tasks);
Console.ReadLine();
}
static async Task ShowIndexAsync(int index, DateTime start)
{
Console.WriteLine("ShowIndexAsync: {0} ({1})",
index, DateTime.Now - start);
await Task.Delay(index * 100);
Console.WriteLine("!ShowIndexAsync: {0} ({1})",
index, DateTime.Now - start);
}
static Task ShowIndex(int index, DateTime start)
{
return Task.Factory.StartNew(() => {
Console.WriteLine("ShowIndex: {0} ({1})",
index, DateTime.Now - start);
Task.Delay(index * 100).Wait();
Console.WriteLine("!ShowIndex: {0} ({1})",
index, DateTime.Now - start);
});
}
}
}
So this calls ShowIndexAsync 10 times then ShowIndex 10 times and waits for them to finish. ShowIndexAsync is "async to the core" while ShowIndex is not, but they both operate on tasks. The blocking operation here is Task.Delay, and the difference being one awaits that task, while the other .Wait()'s it inside of a task.
You'd expect the first ones to be queued (ShowIndexAsync) to finish first but you'd be incorrect.
ShowIndexAsync: 0 (00:00:00.0060000)
!ShowIndexAsync: 0 (00:00:00.0070000)
ShowIndexAsync: 1 (00:00:00.0080000)
ShowIndexAsync: 2 (00:00:00.0110000)
ShowIndexAsync: 3 (00:00:00.0110000)
ShowIndexAsync: 4 (00:00:00.0120000)
ShowIndexAsync: 5 (00:00:00.0130000)
ShowIndexAsync: 6 (00:00:00.0130000)
ShowIndexAsync: 7 (00:00:00.0140000)
ShowIndexAsync: 8 (00:00:00.0150000)
ShowIndexAsync: 9 (00:00:00.0150000)
ShowIndex: 0 (00:00:00.0020000)
!ShowIndex: 0 (00:00:00.0020000)
ShowIndex: 1 (00:00:00.0030000)
!ShowIndex: 1 (00:00:00.1100000)
ShowIndex: 2 (00:00:00.1100000)
!ShowIndex: 2 (00:00:00.3200000)
ShowIndex: 3 (00:00:00.3200000)
!ShowIndex: 3 (00:00:00.6220000)
ShowIndex: 4 (00:00:00.6220000)
!ShowIndex: 4 (00:00:01.0280000)
ShowIndex: 5 (00:00:01.0280000)
!ShowIndex: 5 (00:00:01.5420000)
ShowIndex: 6 (00:00:01.5420000)
!ShowIndex: 6 (00:00:02.1500000)
ShowIndex: 7 (00:00:02.1510000)
!ShowIndex: 7 (00:00:02.8650000)
ShowIndex: 8 (00:00:02.8650000)
!ShowIndex: 8 (00:00:03.6660000)
ShowIndex: 9 (00:00:03.6660000)
!ShowIndex: 9 (00:00:04.5780000)
!ShowIndexAsync: 1 (00:00:04.5950000)
!ShowIndexAsync: 2 (00:00:04.5960000)
!ShowIndexAsync: 3 (00:00:04.5970000)
!ShowIndexAsync: 4 (00:00:04.5970000)
!ShowIndexAsync: 5 (00:00:04.5980000)
!ShowIndexAsync: 6 (00:00:04.5990000)
!ShowIndexAsync: 7 (00:00:04.5990000)
!ShowIndexAsync: 8 (00:00:04.6000000)
!ShowIndexAsync: 9 (00:00:04.6010000)
Why did that happen?
The task scheduler is only going to use so many real threads. "await" compiles to a cooperative multi-tasking state machine. If you have a blocking operation that is not awaited, in this example Task.Delay(...).Wait()
, but in my question, the Regex matching, it's not going to cooperate and let the task scheduler properly manage tasks.
If we change our sample program to:
static async void MainAsync(string[] args)
{
var asyncStart = DateTime.Now;
await Task.WhenAll(Enumerable.Range(0, 10).Select(i =>
ShowIndexAsync(i, asyncStart)));
var start = DateTime.Now;
await Task.WhenAll(Enumerable.Range(0, 10).Select(i =>
ShowIndex(i, start)));
Console.ReadLine();
}
Then our output changes to:
ShowIndexAsync: 0 (00:00:00.0050000)
!ShowIndexAsync: 0 (00:00:00.0050000)
ShowIndexAsync: 1 (00:00:00.0060000)
ShowIndexAsync: 2 (00:00:00.0080000)
ShowIndexAsync: 3 (00:00:00.0090000)
ShowIndexAsync: 4 (00:00:00.0090000)
ShowIndexAsync: 5 (00:00:00.0100000)
ShowIndexAsync: 6 (00:00:00.0110000)
ShowIndexAsync: 7 (00:00:00.0110000)
ShowIndexAsync: 8 (00:00:00.0120000)
ShowIndexAsync: 9 (00:00:00.0120000)
!ShowIndexAsync: 1 (00:00:00.1150000)
!ShowIndexAsync: 2 (00:00:00.2180000)
!ShowIndexAsync: 3 (00:00:00.3160000)
!ShowIndexAsync: 4 (00:00:00.4140000)
!ShowIndexAsync: 5 (00:00:00.5190000)
!ShowIndexAsync: 6 (00:00:00.6130000)
!ShowIndexAsync: 7 (00:00:00.7190000)
!ShowIndexAsync: 8 (00:00:00.8170000)
!ShowIndexAsync: 9 (00:00:00.9170000)
ShowIndex: 0 (00:00:00.0030000)
!ShowIndex: 0 (00:00:00.0040000)
ShowIndex: 3 (00:00:00.0060000)
ShowIndex: 4 (00:00:00.0090000)
ShowIndex: 2 (00:00:00.0100000)
ShowIndex: 1 (00:00:00.0100000)
ShowIndex: 5 (00:00:00.0130000)
ShowIndex: 6 (00:00:00.0130000)
ShowIndex: 7 (00:00:00.0150000)
ShowIndex: 8 (00:00:00.0180000)
!ShowIndex: 7 (00:00:00.7660000)
!ShowIndex: 6 (00:00:00.7660000)
ShowIndex: 9 (00:00:00.7660000)
!ShowIndex: 2 (00:00:00.7660000)
!ShowIndex: 5 (00:00:00.7660000)
!ShowIndex: 4 (00:00:00.7660000)
!ShowIndex: 3 (00:00:00.7660000)
!ShowIndex: 1 (00:00:00.7660000)
!ShowIndex: 8 (00:00:00.8210000)
!ShowIndex: 9 (00:00:01.6700000)
Notice how the async calls have a nice even end time distribution but the non-async code does not. The task scheduler is getting blocked because it wont create additional real threads because it's expecting cooperation.
I don't expect it to take less CPU time or the like, but my goal is to make FindThings
multi-task in a cooperative manor, ie, make it "async to the core."
Regex searches are a CPU-bound operation, so they're going to take time. You can use Task.Run
to push the work off to a background thread and thus keep your UI responsive, but it won't help them go any faster.
Since your searches are already in parallel, there's probably nothing more you can do. You could try using asynchronous file reads to reduce the number of blocked threads in your thread pool, but it probably won't have a huge effect.
Your current code is calling ReadToEndAsync
but it needs to open the file for asynchronous access (i.e., use the FileStream
constructor and explicitly ask for an asynchronous file handle by passing true
for the isAsync
parameter or FileOptions.Asynchronous
for the options
parameter).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With