Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does .NET's regex engine behave so bizarrely when I omit the "else" from a conditional group?

Code:

Match match = Regex.Match("abc", "(?(x)bx)");
Console.WriteLine("Success: {0}", match.Success);
Console.WriteLine("Value: \"{0}\"", match.Value);
Console.WriteLine("Index: {0}", match.Index);

Output:

Success: True
Value: ""
Index: 1

It seems that a conditional group without an "else" expression will instead create a lookahead from the first character of the "if" expression and use that as the "else". In this case it would run as if the regex was (?(x)bx|(?=b))

What the **** is going on here? Is this intentional? It doesn't seem to be documented.

Edit: An issue has been created in the corefx repository: https://github.com/dotnet/corefx/issues/26787

like image 255
Kendall Frey Avatar asked Feb 02 '18 01:02

Kendall Frey


People also ask

Why you should not use regex?

Regex isn't suited to parse HTML because HTML isn't a regular language. Regex probably won't be the tool to reach for when parsing source code. There are better tools to create tokenized outputs. I would avoid parsing a URL's path and query parameters with regex.

Does compiling regex make it faster?

Regex has an interpreted mode and a compiled mode. The compiled mode takes longer to start, but is generally faster.

Does regex affect performance?

Being more specific with your regular expressions, even if they become much longer, can make a world of difference in performance. The fewer characters you scan to determine the match, the faster your regexes will be.

What does \+ mean in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).


1 Answers

I think it may be a mis-optimization. As Alternation Constructs in Regular Expressions points out:

Because the regular expression engine interprets expression as an anchor (a zero-width assertion), expression must either be a zero-width assertion (for more information, see Anchors) or a subexpression that is also contained in yes.

Your expression value satisfies neither of these constraints. I suspect some form of optimization where, since the expression isn't zero-width the input is advanced until the yes can potentially be satisfied (since that's the only pattern you've given the regex engine to work with)

As pointed out in the comments, since your expression isn't also contained in yes, the pattern can never match and so it's unlikely too much concern would be raised about the mis-optimization.

like image 195
Damien_The_Unbeliever Avatar answered Oct 17 '22 23:10

Damien_The_Unbeliever