Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

.NET Core 2.0 Regex Timeout deadlocking

I have a .NET Core 2.0 application where I iterate over many files (600,000) of varying sizes (220GB total).

I enumerate them using

new DirectoryInfo(TargetPath)
    .EnumerateFiles("*.*", SearchOption.AllDirectories)
    .GetEnumerator()

and iterate over them using

Parallel.ForEach(contentList.GetConsumingEnumerable(),
    new ParallelOptions
    {
        MaxDegreeOfParallelism = Environment.ProcessorCount * 2
    },
    file => ...

Inside of that, I have a list of regex expressions that I then scan the file with using

Parallel.ForEach(_Rules,
    new ParallelOptions
    {
        MaxDegreeOfParallelism = Environment.ProcessorCount * 2
    },
    rule => ... 

Finally, I get the matches using an instance of the Regex class

RegEx = new Regex(
    Pattern.ToLowerInvariant(),
    RegexOptions.Multiline | RegexOptions.Compiled,
    TimeSpan.FromSeconds(_MaxSearchTime))

This instance is shared among all files so I compile it once. There are 175 patterns that are applied to the files.

At random (ish) spots, the application deadlocks and is completely unresponsive. No amount of try/catch stops this from happening. If I take the exact same code and compile it for .NET Framework 4.6 it works without any problems.

I've tried LOTS of things and my current test which seems to work (but I am very wary!) is to NOT use an INSTANCE, but instead to call the STATIC Regex.Matches method every time. I can't tell how much of a hit I am taking on performance, but at least I am not getting deadlocks.

I could use some insight or at least serve as a cautionary tale.

Update: I get the file list like this:

private void GetFiles(string TargetPath, BlockingCollection<FileInfo> ContentCollector)
    {
        List<FileInfo> results = new List<FileInfo>();
        IEnumerator<FileInfo> fileEnum = null;
        FileInfo file = null;
        fileEnum = new DirectoryInfo(TargetPath).EnumerateFiles("*.*", SearchOption.AllDirectories).GetEnumerator();
        while (fileEnum.MoveNext())
        {
            try
            {
                file = fileEnum.Current;
                //Skip long file names to mimic .Net Framework deficiencies
                if (file.FullName.Length > 256) continue;
                ContentCollector.Add(file);
            }
            catch { }
        }
        ContentCollector.CompleteAdding();
    }

Inside my Rule class, here are the relevant methods:

_RegEx = new Regex(Pattern.ToLowerInvariant(), RegexOptions.Multiline | RegexOptions.Compiled, TimeSpan.FromSeconds(_MaxSearchTime));
...
    public MatchCollection Matches(string Input) { try { return _RegEx.Matches(Input); } catch { return null; } }
    public MatchCollection Matches2(string Input) { try { return Regex.Matches(Input, Pattern.ToLowerInvariant(), RegexOptions.Multiline, TimeSpan.FromSeconds(_MaxSearchTime)); } catch { return null; } }

Then, here is the matching code:

    public List<SearchResult> GetMatches(string TargetPath)
    {
        //Set up the concurrent containers
        ConcurrentBag<SearchResult> results = new ConcurrentBag<SearchResult>();
        BlockingCollection<FileInfo> contentList = new BlockingCollection<FileInfo>();

        //Start getting the file list
        Task collector = Task.Run(() => { GetFiles(TargetPath, contentList); });
        int cnt = 0;
        //Start processing the files.
        Task matcher = Task.Run(() =>
        {
            //Process each file making it as parallel as possible                
            Parallel.ForEach(contentList.GetConsumingEnumerable(), new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 2 }, file =>
            {
                //Read in the whole file and make it lowercase
                //This makes it so the Regex engine does not have
                //to do it for each 175 patterns!
                StreamReader stream = new StreamReader(File.OpenRead(file.FullName));
                string inputString = stream.ReadToEnd();
                stream.Close();
                string inputStringLC = inputString.ToLowerInvariant();

                //Run through all the patterns as parallel as possible
                Parallel.ForEach(_Rules, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 2 }, rule =>
                {
                    MatchCollection matches = null;
                    int matchCount = 0;
                    Stopwatch ruleTimer = Stopwatch.StartNew();

                    //Run the match for the rule and then get our count (does the actual match iteration)
                    try
                    {
                        //This does not work - Causes Deadlocks:
                        //matches = rule.Matches(inputStringLC);

                        //This works - No Deadlocks;
                        matches = rule.Matches2(inputStringLC);

                        //Process the regex by calling .Count()
                        if (matches == null) matchCount = 0;
                        else matchCount = matches.Count;
                    }

                    //Catch timeouts
                    catch (Exception ex)
                    {
                        //Log the error
                        string timeoutMessage = String.Format("****** Regex Timeout: {0} ===> {1} ===> {2}", ruleTimer.Elapsed, rule.Pattern, file.FullName);
                        Console.WriteLine(timeoutMessage);
                        matchCount = 0;
                    }
                    ruleTimer.Stop();

                    if (matchCount > 0)
                    {
                        cnt++;
                        //Iterate the matches and generate our match records
                        foreach (Match match in matches)
                        {
                            //Fill my result object
                            //...

                            //Add the result to the collection
                            results.Add(result);
                        }
                    }
                });
            });
        });

        //Wait until all are done.
        Task.WaitAll(collector, matcher);

        Console.WriteLine("Found {0:n0} files with {1:n0} matches", cnt, results.Count);


        return results.ToList();
    }

Update 2 The test I was running did not deadlock, but when it got close to the end, it seemed to stall, but I could still break into the process with VS. I then realized I didn't have the Timeout set on my test whereas I did in the code I posted (rule.Matches and rule.Matches2). WITH the timeout, it deadlocks. WITHOUT the timeout, it does not. Both still work in .Net Framework 4.6. I need the timeout on the regex because there are some large files that some of the patterns stall out on.

Update 3: I've been playing with timeout values and it seems to be some combination of threads running, exceptions from timeouts, and the timeout value that causes the Regex engine to deadlock. I can't pin it down exactly, but a timeout >= 5 minutes seems to help. As a temporary fix, I may set the value to 10 minutes, but this is not a permanent fix!

like image 912
James Nix Avatar asked May 15 '18 18:05

James Nix


1 Answers

If I should guess, I would blame Regex

  • RegexOptions.Compiled is not implemented in .NET Core 2.0 (source)
  • Some of your 175 patterns may be slightly evil

This may lead to significant performance difference between .NET Framework 4.6 and .NET Core 2.0 which may result in unresponsive application.

like image 147
Jakub Šturc Avatar answered Oct 24 '22 01:10

Jakub Šturc