Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex match take a very long time to execute

I wrote a regular expression that parses a file path into different group (DRIVE, DIR, FILE, EXTENSION).

^((?<DRIVE>[a-zA-Z]):\\)*((?<DIR>[a-zA-Z0-9_]+(([a-zA-Z0-9_\s_\-\.]*[a-zA-Z0-9_]+)|([a-zA-Z0-9_]+)))\\)*(?<FILE>([a-zA-Z0-9_]+(([a-zA-Z0-9_\s_\-\.]*[a-zA-Z0-9_]+)|([a-zA-Z0-9_]+))\.(?<EXTENSION>[a-zA-Z0-9]{1,6})$))

I made a test in C#. When the path I want to test is correct. The result is very quick and this is what I wanted to expect.

string path = @"C:\Documents and Settings\jhr\My Documents\Visual Studio 2010\Projects\FileEncryptor\Dds.FileEncryptor\Dds.FileEncryptor.csproj";

=> OK

But when I try to test with a path that I know that will not match, like this :

string path = @"C:\Documents and Settings\jhr\My Documents\Visual Studio 2010\Projects\FileEncryptor\Dds.FileEncryptor\Dds.FileEncryptor?!??????";

=> BUG

The test freezes when I call this part of code

Match match = s_fileRegex.Match(path);

When i look into my Process Explorer, I see the process QTAgent32.exe hanging at 100% of my processor. What does it mean ?

like image 466
RedPaladin Avatar asked Aug 04 '10 07:08

RedPaladin


2 Answers

The problem you are experiencing is called catastrophic backtracking and is due to the large number of ways that you regular expression can match the start of the string, which gives slow performance due to the backtracking regular expression engine in .NET.

I think you are using * too frequently in your regular expression. * does not mean "concatenate" - it means "0 or more times". For example there should not be a * here:

((?<DRIVE>[a-zA-Z]):\\)*

There should be at most one drive specification. You should use ? instead here, or else no quantifier at all if you want the drive specification to be compulsory. Similarly there appear to be other places in your regular expression where the quantifier is incorrect.

like image 109
Mark Byers Avatar answered Oct 03 '22 17:10

Mark Byers


Mark Byers is correct in that the reason for the problem is catastrophic backtracking, however it's the last part that's causing the problem, not the bit that matches the drive letter.

For example, in

(?<FILE>
  ([a-zA-Z0-9_]+
    (
      ([a-zA-Z0-9_\s_\-\.]*[a-zA-Z0-9_]+)
    |
      ([a-zA-Z0-9_]+)
    )\.
    (?<EXTENSION>[a-zA-Z0-9]{1,6})
  $)
)

you can see that

([a-zA-Z0-9_\s_\-\.]*[a-zA-Z0-9_]+)
|
([a-zA-Z0-9_]+)

can match the same string in a number of different ways that will increase exponentially with the length of the filename.

When it happens that the extension part of the regex fails to match, the regex engine backtracks and tries a different permutation for the filename part, hoping that this enables the extension part to match - which of course it never will, but the regex engine can't figure that out. RegexBuddy, when asked to test the regex on the path you provided, aborts the match attempt after 1.000.000 iterations. The C# regex engine will plod on until it has exhausted all permutations, pinning your CPU at 100% during that time.

To fix this, it's generally necessary to avoid repetitions of repeated elements, to avoid alternations that match the same things, and possibly to enclose parts of the match in atomic groups that will not be backtracked into if a later part of the regex fails.

In your case however, it's better to use the right tools for the job, and those are the path manipulation functions of .NET.

like image 32
Tim Pietzcker Avatar answered Oct 03 '22 17:10

Tim Pietzcker