Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

EOL Special Char not matching

Tags:

c#

regex

I am trying to find every "a -> b, c, d" pattern in an input string. The pattern I am using is the following :

"^[ \t]*(\\w+)[ \t]*->[ \t]*(\\w+)((?:,[ \t]*\\w+)*)$"

This pattern is a C# pattern, the "\t" refers to a tabulation (its a single escaped litteral, intepreted by the .NET String API), the "\w" refers to the well know regex litteral predefined class, double escaped to be interpreted as a "\w" by the .NET STring API, and then as a "WORD CLASS" by the .NET Regex API.

The input is :

a -> b
b -> c
c -> d

The function is :

private void ParseAndBuildGraph(String input) {
    MatchCollection mc = Regex.Matches(input, "^[ \t]*(\\w+)[ \t]*->[ \t]*(\\w+)((?:,[ \t]*\\w+)*)$", RegexOptions.Multiline);
    foreach (Match m in mc) {
        Debug.WriteLine(m.Value);
    }
}

The output is :

c -> d

Actually, there is a problem with the line ending "$" special char. If I insert a "\r" before "$", it works, but I thought "$" would match any line termination (with the Multiline option), especially a \r\n in a Windows environment. Is it not the case ?

like image 346
Aurelien Ribon Avatar asked Jan 22 '23 10:01

Aurelien Ribon


1 Answers

This surprised me, too. In .NET regexes, $ doesn't match before a line separator, it matches before a linefeed--the character \n. This behavior is consistent with Perl's regex flavor, but it's still wrong, in my opinion. According to the Unicode standard, $ should match before any of:

\n, \r\n, \r, \x85, \u2028, \u2029, \v or \f

...and never match between \r and \n. Java complies with that (except \v and \f), but .NET, which came out long after Java, and whose Unicode support is at least as good as Java's, only recognizes \n. You'd think they would at least handle \r\n correctly, given how strongly Microsoft is associated with that line separator.

Be aware that . follows the same pattern: it doesn't match \n (unless Singleline mode is set), but it does match \r. If you had used .+ instead of \w+ in your regex, you might not have noticed this problem; the carriage-return would have been included in the match, but the console would have ignored it when you printed the results.

EDIT: If you want to allow for the carriage return without including it in your results, you can replace the anchor with a lookahead: (?=\r?\n.

like image 131
Alan Moore Avatar answered Jan 25 '23 00:01

Alan Moore