Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is this regex hanging with two trailing back-slashes but works fine with just one?

Tags:

c#

.net

regex

I have the following code using regex that hangs at the last line when I run it:

const char PathSeparator = '\\';
const string PathPrefix = "\\\\";

var reg = new Regex(string.Format("^{0}(([^{1}\r\n\v\f\u2028\u2029]+){1}?)+$", 
    Regex.Escape(PathPrefix), Regex.Escape(PathSeparator.ToString())));

var test = @"\\Inbox\Test123\Intermediate\3322FB69-FE3F-407E-9E15-584382A407A2\\";
var matches = reg.Matches(test);
var group = matches[0].Groups[2];

It works fine if I remove one of the final backslashes in test, e.g.

var test = @"\\Inbox\Test123\Intermediate\3322FB69-FE3F-407E-9E15-584382A407A2\";

Can someone please help me figure out why it's hanging? I know I can set a timeout; that's not my question.

like image 839
rory.ap Avatar asked Oct 05 '15 20:10

rory.ap


1 Answers

Now you have two problems

I have to begin this answer with the obligatory mention of the well-known quote. While regex is a powerful tool that can solve lots of problems (including this one) efficiently, it can also be a big source of trouble if you are not an expert on the subject and cause pain that would be entirely avoidable by going with lower-tech solutions.

In this instance you could trivially do the same job by splitting the input string on slashes, then validating each token produced on its own (yes, perhaps even with a regex). And it might actually be a smart move to do that because it would immensely reduce the complexity factor of the solution; something that will be quite valuable in the future when "a small tweak" (that increases complexity) is required.

That said, let's look at the interesting stuff: what happened?

Why does it run forever?

Because of excessive backtracking caused by the nested + quantifiers. This will be easier to illustrate using a more tame sample input of

\\Foo\Bar\Baz\\

Let's see what the regex engine would attempt to match when fed this input.

1) First attempt: the innermost group matches Foo\ and Bar\ and Baz\. Further repetition fails because of the trailing \, then the end-of-input marker fails to match so the attempt is rejected and the engine backtracks.

Note that if the trailing slash were not there, the regex engine would declare success and return at this point.

2) Because the slash in the innermost group is optional, the next attempt is Foo\, Bar\, Baz which also obviously fails. More backtracking.

3) The next candidate matches four repetitions of the innermost group: Foo\, Bar\, Ba and z\. Fail, more backtracking.

From now on I will simply list the attempted matches by separating repetitions of the innermost group with a single space:

Foo\ Bar\ Ba z      (i.e. 4 repetitions of length 4, 4, 2, 1, and fails)
Foo\ Bar\ Ba
Foo\ Bar\ B az\
Foo\ Bar\ B az
Foo\ Bar\ B a
Foo\ Bar\

And so on.

The thing to note here is that it took an unreasonable number of attempts to even rule out that the trailing "Baz" segment can be part of a successful match: we had to consider "Baz", "Ba" + "z", "Ba", "B" + "az", "B" + "a", and "B"; in general, for a segment of length N the number of possibilities is N! (factorial). The factorial function's growth is beyond intuitive human comprehension as Wikipedia illustrates.

Since your sample input contains a final segment of length 36 it's obvious that attempting to match this regex with incorrectly formatted input will never finish.

How to prevent this from happening?

In this case it's pretty simple. Since you know that with e.g. "Baz" there is no point in trying to split it into smaller chunks if matching it as a whole fails (because these chunks would need to be separated by a slash, which we already know they are not because slashes are not an allowed part of the attempted match), use an atomic capturing group:

var reg = new Regex(string.Format(
    "^{0}(?>([^{1}\r\n\v\f\u2028\u2029]+{1}?))+$",
    Regex.Escape(PathPrefix),
    Regex.Escape(PathSeparator.ToString())));

This will backtrack only 1 time instead of N! for each path segment on an unsuccessful match, reducing the time to fail to practically zero.

like image 191
Jon Avatar answered Nov 01 '22 19:11

Jon