My problem is quite complex, but can be boiled down to a simple example.
I am writing a custom query language where users can input strings which I parse to LinQ Expressions.
What I would like to able to do is to split strings by the *
character, unless it is correctly escaped.
Input Output Query Description
"*\\*" --> { "*", "\\", "*" } -- contains a '\'
"*\\\**" --> { "*", "\\\*", "*" } -- contains '\*'
"*\**" --> { "*", "\*", "*" } -- contains '*' (works now)
I don't mind Regex.Split
returning empty strings, but I end up with this:
Regex.Split(@"*\\*", @"(?<!\\)(\*)") --> {"", "*", "\\*"}
As you can see, I have tried with negative lookbehind, which works for all my cases except this one. I have also tried Regex.Escape
, but with no luck.
Obviously, my problem is that I am looking for \*
, which \\*
matches. But in this case,
\\
is another escaped sequence.
Any solution doesn't necessary have to involve a Regex.
The backslash character (\) in a regular expression indicates that the character that follows it either is a special character (as shown in the following table), or should be interpreted literally. For more information, see Character Escapes. Escaped character. Description. Pattern.
Now, escaping a string (in regex terms) means finding all of the characters with special meaning and putting a backslash in front of them, including in front of other backslash characters. When you've done this one time on the string, you have officially "escaped the string".
Escape sequences are used inside strings, not just those for printf, to represent special characters. In particular, the \n escape sequence represents the newline character.
Escape all non-alphanumeric characters.
I think it's much easier to match than to split, especially since you are not removing anything from the initial string. So what to match? Everything except an unescaped *
.
How to do that? With the below regex:
@"(?:[^*\\]+|\\.)+|\*"
(?:[^*\\]+|\\.)+
matches everything that is not a *
, or any escaped character. No need for any lookaround.
\*
will match the separator.
In code:
using System;
using System.Text.RegularExpressions;
using System.Linq;
public class Test
{
public static void Main()
{
string[] tests = new string[]{
@"*\\*",
@"*\\\**",
@"*\**",
};
Regex re = new Regex(@"(?:[^*\\]+|\\.)+|\*");
foreach (string s in tests) {
var parts = re.Matches(s)
.OfType<Match>()
.Select(m => m.Value)
.ToList();
Console.WriteLine(string.Join(", ", parts.ToArray()));
}
}
}
Output:
*, \\, *
*, \\\*, *
*, \*, *
ideone demo
I've came up with this regexp (?<=(?:^|[^\\])(?:\\\\)*)(\*)
.
You just white-list situations that can happen before *
and these are:
^
\
- [^\\]
\
or beginning of the string) and then even number of \
- (^|[^\\])(\\\\)*
string[] tests = new string[]{
@"*\\*",
@"*\\\**",
@"*\**",
@"test\**test2",
};
Regex re = new Regex(@"(?<=(?:^|[^\\])(?:\\\\)*)(\*)");
foreach (string s in tests) {
string[] m = re.Split( s );
Console.WriteLine(String.Format("{0,-20} {1}", s, String.Join(", ",
m.Where(x => !String.IsNullOrEmpty(x)))));
}
*\\* *, \\, *
*\\\** *, \\\*, *
*\** *, \*, *
test\**test2 test\*, *, test2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With