I have a .NET Core 2.0 application where I iterate over many files (600,000) of varying sizes (220GB total).
I enumerate them using
new DirectoryInfo(TargetPath)
.EnumerateFiles("*.*", SearchOption.AllDirectories)
.GetEnumerator()
and iterate over them using
Parallel.ForEach(contentList.GetConsumingEnumerable(),
new ParallelOptions
{
MaxDegreeOfParallelism = Environment.ProcessorCount * 2
},
file => ...
Inside of that, I have a list of regex expressions that I then scan the file with using
Parallel.ForEach(_Rules,
new ParallelOptions
{
MaxDegreeOfParallelism = Environment.ProcessorCount * 2
},
rule => ...
Finally, I get the matches using an instance of the Regex class
RegEx = new Regex(
Pattern.ToLowerInvariant(),
RegexOptions.Multiline | RegexOptions.Compiled,
TimeSpan.FromSeconds(_MaxSearchTime))
This instance is shared among all files so I compile it once. There are 175 patterns that are applied to the files.
At random (ish) spots, the application deadlocks and is completely unresponsive. No amount of try/catch stops this from happening. If I take the exact same code and compile it for .NET Framework 4.6 it works without any problems.
I've tried LOTS of things and my current test which seems to work (but I am very wary!) is to NOT use an INSTANCE, but instead to call the STATIC Regex.Matches
method every time. I can't tell how much of a hit I am taking on performance, but at least I am not getting deadlocks.
I could use some insight or at least serve as a cautionary tale.
Update: I get the file list like this:
private void GetFiles(string TargetPath, BlockingCollection<FileInfo> ContentCollector)
{
List<FileInfo> results = new List<FileInfo>();
IEnumerator<FileInfo> fileEnum = null;
FileInfo file = null;
fileEnum = new DirectoryInfo(TargetPath).EnumerateFiles("*.*", SearchOption.AllDirectories).GetEnumerator();
while (fileEnum.MoveNext())
{
try
{
file = fileEnum.Current;
//Skip long file names to mimic .Net Framework deficiencies
if (file.FullName.Length > 256) continue;
ContentCollector.Add(file);
}
catch { }
}
ContentCollector.CompleteAdding();
}
Inside my Rule class, here are the relevant methods:
_RegEx = new Regex(Pattern.ToLowerInvariant(), RegexOptions.Multiline | RegexOptions.Compiled, TimeSpan.FromSeconds(_MaxSearchTime));
...
public MatchCollection Matches(string Input) { try { return _RegEx.Matches(Input); } catch { return null; } }
public MatchCollection Matches2(string Input) { try { return Regex.Matches(Input, Pattern.ToLowerInvariant(), RegexOptions.Multiline, TimeSpan.FromSeconds(_MaxSearchTime)); } catch { return null; } }
Then, here is the matching code:
public List<SearchResult> GetMatches(string TargetPath)
{
//Set up the concurrent containers
ConcurrentBag<SearchResult> results = new ConcurrentBag<SearchResult>();
BlockingCollection<FileInfo> contentList = new BlockingCollection<FileInfo>();
//Start getting the file list
Task collector = Task.Run(() => { GetFiles(TargetPath, contentList); });
int cnt = 0;
//Start processing the files.
Task matcher = Task.Run(() =>
{
//Process each file making it as parallel as possible
Parallel.ForEach(contentList.GetConsumingEnumerable(), new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 2 }, file =>
{
//Read in the whole file and make it lowercase
//This makes it so the Regex engine does not have
//to do it for each 175 patterns!
StreamReader stream = new StreamReader(File.OpenRead(file.FullName));
string inputString = stream.ReadToEnd();
stream.Close();
string inputStringLC = inputString.ToLowerInvariant();
//Run through all the patterns as parallel as possible
Parallel.ForEach(_Rules, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 2 }, rule =>
{
MatchCollection matches = null;
int matchCount = 0;
Stopwatch ruleTimer = Stopwatch.StartNew();
//Run the match for the rule and then get our count (does the actual match iteration)
try
{
//This does not work - Causes Deadlocks:
//matches = rule.Matches(inputStringLC);
//This works - No Deadlocks;
matches = rule.Matches2(inputStringLC);
//Process the regex by calling .Count()
if (matches == null) matchCount = 0;
else matchCount = matches.Count;
}
//Catch timeouts
catch (Exception ex)
{
//Log the error
string timeoutMessage = String.Format("****** Regex Timeout: {0} ===> {1} ===> {2}", ruleTimer.Elapsed, rule.Pattern, file.FullName);
Console.WriteLine(timeoutMessage);
matchCount = 0;
}
ruleTimer.Stop();
if (matchCount > 0)
{
cnt++;
//Iterate the matches and generate our match records
foreach (Match match in matches)
{
//Fill my result object
//...
//Add the result to the collection
results.Add(result);
}
}
});
});
});
//Wait until all are done.
Task.WaitAll(collector, matcher);
Console.WriteLine("Found {0:n0} files with {1:n0} matches", cnt, results.Count);
return results.ToList();
}
Update 2
The test I was running did not deadlock, but when it got close to the end, it seemed to stall, but I could still break into the process with VS. I then realized I didn't have the Timeout set on my test whereas I did in the code I posted (rule.Matches
and rule.Matches2
). WITH the timeout, it deadlocks. WITHOUT the timeout, it does not. Both still work in .Net Framework 4.6. I need the timeout on the regex because there are some large files that some of the patterns stall out on.
Update 3: I've been playing with timeout values and it seems to be some combination of threads running, exceptions from timeouts, and the timeout value that causes the Regex engine to deadlock. I can't pin it down exactly, but a timeout >= 5 minutes seems to help. As a temporary fix, I may set the value to 10 minutes, but this is not a permanent fix!
If I should guess, I would blame Regex
RegexOptions.Compiled
is not implemented in .NET Core 2.0 (source)This may lead to significant performance difference between .NET Framework 4.6 and .NET Core 2.0 which may result in unresponsive application.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With