Say I have a toy language that has the following string:
fun( fun3\(\) ) + fun4()
Here, 'fun' receives 'fun3()' as its argument. And fun4() is left for later evaluation.
Now say I have a different string:
fun( fun3()\\) )
Here, 'fun' should receive 'fun3()\' and we have a ) leftover.
Escaping a '\' by doing '\' means we get it literally - and thusly, /that pair/ of '\'s no longer escapes the bracket. A third \ would again escape the bracket, etc.
Now, say I want to match this string using C#)'s more capable Regex library, using the way it matches brackets, and specifically in that way; I am aware that normally I'd use a proper parsing method rather than (expanded) regular expressions. This is less about what tool I should be using, and more about what this tool can do.
I'll use the following three strings as my tests.
fun(abc) fun3()
Which will mean fun() receives 'abc' as its argument. fun3() is leftover.
fun(\\\)\)) fun3()
Which will mean fun() receives '\))' as its argument. fun3() is leftover.
fun(fun2(\)\\\() ) fun3()
Which will mean fun() receives 'fun2()\()' as its argument. fun3() is leftover.
As Alan Moore presumed in this StackOverflow question, the first thing I want to use is LookBehind. The Regex below handles the first case, but obviously not the second case. It's too quick to take the first ')' it sees.
Regex catchRegex = new Regex(@"^fun\((.*?(?<!\\)(?:\\\\)*)(?<ClosingChar>[\)])(.*$)");
string testcase0 = @"fun(abc) fun3()";
string testcase1 = @"fun(\\\)\)) fun3()";
string testcase2 = @"fun(fun2(\)\\\() ) fun3()";
Console.WriteLine(catchRegex.Match(testcase0).Groups[1]); // 'abc'
Console.WriteLine(catchRegex.Match(testcase0).Groups[2]); // ' fun3()'
Console.WriteLine(catchRegex.Match(testcase0).Groups[3]); // ')'
Console.WriteLine(catchRegex.Match(testcase1).Groups[1]); // '\\\)\)'
Console.WriteLine(catchRegex.Match(testcase1).Groups[2]); // ' fun3()'
Console.WriteLine(catchRegex.Match(testcase1).Groups[3]); // ')'
Console.WriteLine(catchRegex.Match(testcase2).Groups[1]); // 'fun2(\)\\\(' <--!
Console.WriteLine(catchRegex.Match(testcase2).Groups[2]); // ' ) fun3()' <--!
Console.WriteLine(catchRegex.Match(testcase2).Groups[3]); // ')'
So now we're down to doing what .NET can do. Bracket matching. It passes on the first test... but because I don't tell it not to care about things that are escaped, it fails the others. This is only fair.
Regex bracketRegex = new Regex(@"^fun\(([^\)]*|(?<BR>)\(|(?<-BR>)\))(?<ClosingChar>[\)])(.*$)");
Console.WriteLine(bracketRegex.Match(testcase0).Groups[1]); // 'abc'
Console.WriteLine(bracketRegex.Match(testcase0).Groups[2]); // ' fun3()'
Console.WriteLine(bracketRegex.Match(testcase0).Groups[3]); // ''
Console.WriteLine(bracketRegex.Match(testcase1).Groups[1]); // '\\\'
Console.WriteLine(bracketRegex.Match(testcase1).Groups[2]); // '\)) fun3()'
Console.WriteLine(bracketRegex.Match(testcase1).Groups[3]); // ''
Console.WriteLine(bracketRegex.Match(testcase2).Groups[1]); // 'fun2(\' <--!
Console.WriteLine(bracketRegex.Match(testcase2).Groups[2]); // '\\\() ) fun3()' <--!
Console.WriteLine(bracketRegex.Match(testcase2).Groups[3]); // ''
But the problem is the next step. Combining version 1 and version 2 doesn't actually get me anything or anywhere. So to you the question, StackOverflow, is there a way to do this?
Regex bracketAwareRegex = new Regex(@"^fun\(([^\)]*|(?<BR>)(?<!\\)(?:\\\\)*\(|(?<-BR>)(?<!\\)(?:\\\\)*\))(?<ClosingChar>[\)])(.*$)");
Console.WriteLine(bracketAwareRegex.Match(testcase0).Groups[1]); // 'abc'
Console.WriteLine(bracketAwareRegex.Match(testcase0).Groups[2]); // ' fun3()'
Console.WriteLine(bracketAwareRegex.Match(testcase0).Groups[3]); // ''
Console.WriteLine(bracketAwareRegex.Match(testcase1).Groups[1]); // '\\\'
Console.WriteLine(bracketAwareRegex.Match(testcase1).Groups[2]); // '\)) fun3()'
Console.WriteLine(bracketAwareRegex.Match(testcase1).Groups[3]); // ''
Console.WriteLine(bracketAwareRegex.Match(testcase2).Groups[1]); // 'fun2(\' <--!
Console.WriteLine(bracketAwareRegex.Match(testcase2).Groups[2]); // '\\\() ) fun3()' <--!
Console.WriteLine(bracketAwareRegex.Match(testcase2).Groups[3]); // ''
Because that didn't work.
I propose this regex:
@"^fun\(((?:[^()\\]|\\.|(?<o>\()|(?<-o>\)))+(?(o)(?!)))\)(.*$)"
ideone demo
I removed the ClosingChar capture.
Results:
string testcase0 = @"fun(abc) fun3()";
Console.WriteLine(catchRegex.Match(testcase0).Groups[1]); // 'abc'
Console.WriteLine(catchRegex.Match(testcase0).Groups[2]); // ' fun3()'
string testcase1 = @"fun(\\\)\)) fun3()";
Console.WriteLine(catchRegex.Match(testcase1).Groups[1]); // '\\\)\)'
Console.WriteLine(catchRegex.Match(testcase1).Groups[2]); // ' fun3()'
string testcase2 = @"fun(fun2(\)\\\() ) fun3()";
Console.WriteLine(catchRegex.Match(testcase2).Groups[1]); // 'fun2(\)\\\()'
Console.WriteLine(catchRegex.Match(testcase2).Groups[2]); // ' fun3()'
I have another way of dealing with escaped characters, which is using something a bit like:
(?:[^()\\]|\\.)
Which ends in the one above when combine with the balancing groups.
^fun\( Match 'fun(' literally at the beginning
(
(?:
[^()\\] Match anything not '(', ')' or '\'
|
\\. Match any escaped char
|
(?<o>\() Match a '(' and name it 'o'
|
(?<-o>\)) Match a ')' and remove the named 'o' capture
)+
(?(o)(?!)) Make regex fail if 'o' doesn't exist
)
\)(.*$) Match anything leftover
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With