Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do async regex's exist in C# and would they help my situation?

My application searches many files in parallel using regex, await Task.WhenAll(filePaths.Select(FindThings));

Inside of FindThings, it spends most of it's time performing the regex search, as these files can be hundreds of mb in size.

static async Task FindThings(string path) {
    string fileContent = null;
    try
    {
        using (var reader = File.OpenText(path))
            fileContent = await reader.ReadToEndAsync();
    }
    catch (Exception e)
    {
        WriteLine(lineIndex, "{0}: Error {1}", filename, e);
        return;
    }

    var exitMatches = _exitExp.Matches(fileContent);

    foreach (Match exit in exitMatches)
    {
        if (_taskDelay > 0)
            await Task.Delay(_taskDelay);

    // [...]
  • Is there an async version of Regex or any way to make this properly cooperative with Tasks?

Why this is important

I'm getting a lot of responses that indicate I didn't clarify why this is important. Take this example program (that uses the Nitro.Async library):

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Nito.AsyncEx;

namespace Scrap
{
    class Program
    {
        static void Main(string[] args)
        {
            AsyncContext.Run(() => MainAsync(args));
        }

        static async void MainAsync(string[] args)
        {
            var tasks = new List<Task>();

            var asyncStart = DateTime.Now;
            tasks.Add(Task.WhenAll(Enumerable.Range(0, 10).Select(i =>
                ShowIndexAsync(i, asyncStart))));

            var start = DateTime.Now;
            tasks.Add(Task.WhenAll(Enumerable.Range(0, 10).Select(i =>
                ShowIndex(i, start))));

            await Task.WhenAll(tasks);

            Console.ReadLine();
        }


        static async Task ShowIndexAsync(int index, DateTime start)
        {
            Console.WriteLine("ShowIndexAsync: {0} ({1})",
                index, DateTime.Now - start);
            await Task.Delay(index * 100);
            Console.WriteLine("!ShowIndexAsync: {0} ({1})",
                index, DateTime.Now - start);
        }

        static Task ShowIndex(int index, DateTime start)
        {
            return Task.Factory.StartNew(() => {
                Console.WriteLine("ShowIndex: {0} ({1})",
                    index, DateTime.Now - start);
                Task.Delay(index * 100).Wait();
                Console.WriteLine("!ShowIndex: {0} ({1})",
                    index, DateTime.Now - start);
            });
        }
    }
}

So this calls ShowIndexAsync 10 times then ShowIndex 10 times and waits for them to finish. ShowIndexAsync is "async to the core" while ShowIndex is not, but they both operate on tasks. The blocking operation here is Task.Delay, and the difference being one awaits that task, while the other .Wait()'s it inside of a task.

You'd expect the first ones to be queued (ShowIndexAsync) to finish first but you'd be incorrect.

ShowIndexAsync: 0 (00:00:00.0060000)
!ShowIndexAsync: 0 (00:00:00.0070000)
ShowIndexAsync: 1 (00:00:00.0080000)
ShowIndexAsync: 2 (00:00:00.0110000)
ShowIndexAsync: 3 (00:00:00.0110000)
ShowIndexAsync: 4 (00:00:00.0120000)
ShowIndexAsync: 5 (00:00:00.0130000)
ShowIndexAsync: 6 (00:00:00.0130000)
ShowIndexAsync: 7 (00:00:00.0140000)
ShowIndexAsync: 8 (00:00:00.0150000)
ShowIndexAsync: 9 (00:00:00.0150000)
ShowIndex: 0 (00:00:00.0020000)
!ShowIndex: 0 (00:00:00.0020000)
ShowIndex: 1 (00:00:00.0030000)
!ShowIndex: 1 (00:00:00.1100000)
ShowIndex: 2 (00:00:00.1100000)
!ShowIndex: 2 (00:00:00.3200000)
ShowIndex: 3 (00:00:00.3200000)
!ShowIndex: 3 (00:00:00.6220000)
ShowIndex: 4 (00:00:00.6220000)
!ShowIndex: 4 (00:00:01.0280000)
ShowIndex: 5 (00:00:01.0280000)
!ShowIndex: 5 (00:00:01.5420000)
ShowIndex: 6 (00:00:01.5420000)
!ShowIndex: 6 (00:00:02.1500000)
ShowIndex: 7 (00:00:02.1510000)
!ShowIndex: 7 (00:00:02.8650000)
ShowIndex: 8 (00:00:02.8650000)
!ShowIndex: 8 (00:00:03.6660000)
ShowIndex: 9 (00:00:03.6660000)
!ShowIndex: 9 (00:00:04.5780000)
!ShowIndexAsync: 1 (00:00:04.5950000)
!ShowIndexAsync: 2 (00:00:04.5960000)
!ShowIndexAsync: 3 (00:00:04.5970000)
!ShowIndexAsync: 4 (00:00:04.5970000)
!ShowIndexAsync: 5 (00:00:04.5980000)
!ShowIndexAsync: 6 (00:00:04.5990000)
!ShowIndexAsync: 7 (00:00:04.5990000)
!ShowIndexAsync: 8 (00:00:04.6000000)
!ShowIndexAsync: 9 (00:00:04.6010000)

Why did that happen?

The task scheduler is only going to use so many real threads. "await" compiles to a cooperative multi-tasking state machine. If you have a blocking operation that is not awaited, in this example Task.Delay(...).Wait(), but in my question, the Regex matching, it's not going to cooperate and let the task scheduler properly manage tasks.

If we change our sample program to:

    static async void MainAsync(string[] args)
    {
        var asyncStart = DateTime.Now;
        await Task.WhenAll(Enumerable.Range(0, 10).Select(i =>
            ShowIndexAsync(i, asyncStart)));

        var start = DateTime.Now;
        await Task.WhenAll(Enumerable.Range(0, 10).Select(i =>
            ShowIndex(i, start)));

        Console.ReadLine();
    }

Then our output changes to:

ShowIndexAsync: 0 (00:00:00.0050000)
!ShowIndexAsync: 0 (00:00:00.0050000)
ShowIndexAsync: 1 (00:00:00.0060000)
ShowIndexAsync: 2 (00:00:00.0080000)
ShowIndexAsync: 3 (00:00:00.0090000)
ShowIndexAsync: 4 (00:00:00.0090000)
ShowIndexAsync: 5 (00:00:00.0100000)
ShowIndexAsync: 6 (00:00:00.0110000)
ShowIndexAsync: 7 (00:00:00.0110000)
ShowIndexAsync: 8 (00:00:00.0120000)
ShowIndexAsync: 9 (00:00:00.0120000)
!ShowIndexAsync: 1 (00:00:00.1150000)
!ShowIndexAsync: 2 (00:00:00.2180000)
!ShowIndexAsync: 3 (00:00:00.3160000)
!ShowIndexAsync: 4 (00:00:00.4140000)
!ShowIndexAsync: 5 (00:00:00.5190000)
!ShowIndexAsync: 6 (00:00:00.6130000)
!ShowIndexAsync: 7 (00:00:00.7190000)
!ShowIndexAsync: 8 (00:00:00.8170000)
!ShowIndexAsync: 9 (00:00:00.9170000)
ShowIndex: 0 (00:00:00.0030000)
!ShowIndex: 0 (00:00:00.0040000)
ShowIndex: 3 (00:00:00.0060000)
ShowIndex: 4 (00:00:00.0090000)
ShowIndex: 2 (00:00:00.0100000)
ShowIndex: 1 (00:00:00.0100000)
ShowIndex: 5 (00:00:00.0130000)
ShowIndex: 6 (00:00:00.0130000)
ShowIndex: 7 (00:00:00.0150000)
ShowIndex: 8 (00:00:00.0180000)
!ShowIndex: 7 (00:00:00.7660000)
!ShowIndex: 6 (00:00:00.7660000)
ShowIndex: 9 (00:00:00.7660000)
!ShowIndex: 2 (00:00:00.7660000)
!ShowIndex: 5 (00:00:00.7660000)
!ShowIndex: 4 (00:00:00.7660000)
!ShowIndex: 3 (00:00:00.7660000)
!ShowIndex: 1 (00:00:00.7660000)
!ShowIndex: 8 (00:00:00.8210000)
!ShowIndex: 9 (00:00:01.6700000)

Notice how the async calls have a nice even end time distribution but the non-async code does not. The task scheduler is getting blocked because it wont create additional real threads because it's expecting cooperation.

I don't expect it to take less CPU time or the like, but my goal is to make FindThings multi-task in a cooperative manor, ie, make it "async to the core."

like image 988
Joseph Lennox Avatar asked Sep 16 '14 19:09

Joseph Lennox


1 Answers

Regex searches are a CPU-bound operation, so they're going to take time. You can use Task.Run to push the work off to a background thread and thus keep your UI responsive, but it won't help them go any faster.

Since your searches are already in parallel, there's probably nothing more you can do. You could try using asynchronous file reads to reduce the number of blocked threads in your thread pool, but it probably won't have a huge effect.

Your current code is calling ReadToEndAsync but it needs to open the file for asynchronous access (i.e., use the FileStream constructor and explicitly ask for an asynchronous file handle by passing true for the isAsync parameter or FileOptions.Asynchronous for the options parameter).

like image 154
Stephen Cleary Avatar answered Oct 05 '22 11:10

Stephen Cleary