Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex - Escape escape characters

Tags:

c#

regex

My problem is quite complex, but can be boiled down to a simple example.

I am writing a custom query language where users can input strings which I parse to LinQ Expressions.

What I would like to able to do is to split strings by the * character, unless it is correctly escaped.

Input         Output                          Query Description
"*\\*"    --> { "*", "\\", "*" }       -- contains a '\'
"*\\\**"  --> { "*", "\\\*", "*" }     -- contains '\*'
"*\**"    --> { "*", "\*", "*" }       -- contains '*' (works now)

I don't mind Regex.Split returning empty strings, but I end up with this:

Regex.Split(@"*\\*", @"(?<!\\)(\*)")  --> {"", "*", "\\*"}

As you can see, I have tried with negative lookbehind, which works for all my cases except this one. I have also tried Regex.Escape, but with no luck.

Obviously, my problem is that I am looking for \*, which \\* matches. But in this case, \\ is another escaped sequence.

Any solution doesn't necessary have to involve a Regex.

like image 915
Troels Larsen Avatar asked Jul 11 '14 09:07

Troels Larsen


People also ask

What does \\ mean in regex?

The backslash character (\) in a regular expression indicates that the character that follows it either is a special character (as shown in the following table), or should be interpreted literally. For more information, see Character Escapes. Escaped character. Description. Pattern.

What does it mean to escape a character in regex?

Now, escaping a string (in regex terms) means finding all of the characters with special meaning and putting a backslash in front of them, including in front of other backslash characters. When you've done this one time on the string, you have officially "escaped the string".

Is \n an escape character?

Escape sequences are used inside strings, not just those for printf, to represent special characters. In particular, the \n escape sequence represents the newline character.

What should be escaped in regex?

Escape all non-alphanumeric characters.


2 Answers

I think it's much easier to match than to split, especially since you are not removing anything from the initial string. So what to match? Everything except an unescaped *.

How to do that? With the below regex:

@"(?:[^*\\]+|\\.)+|\*"

(?:[^*\\]+|\\.)+ matches everything that is not a *, or any escaped character. No need for any lookaround.

\* will match the separator.

In code:

using System;
using System.Text.RegularExpressions;
using System.Linq;
public class Test
{
    public static void Main()
    {   
        string[] tests = new string[]{
            @"*\\*",
            @"*\\\**",
            @"*\**",
        };

        Regex re = new Regex(@"(?:[^*\\]+|\\.)+|\*");

        foreach (string s in tests) {
            var parts = re.Matches(s)
             .OfType<Match>()
             .Select(m => m.Value)
             .ToList();

            Console.WriteLine(string.Join(", ", parts.ToArray()));
        }
    }
}

Output:

*, \\, *
*, \\\*, *
*, \*, *

ideone demo

like image 126
Jerry Avatar answered Oct 23 '22 09:10

Jerry


I've came up with this regexp (?<=(?:^|[^\\])(?:\\\\)*)(\*).

Explanation:

You just white-list situations that can happen before * and these are:

  • start of the string ^
  • not \ - [^\\]
  • (not \ or beginning of the string) and then even number of \ - (^|[^\\])(\\\\)*

Test code and examples:

string[] tests = new string[]{
    @"*\\*",
    @"*\\\**",
    @"*\**",
    @"test\**test2",
};

Regex re = new Regex(@"(?<=(?:^|[^\\])(?:\\\\)*)(\*)");

foreach (string s in tests) {
    string[] m = re.Split( s );
    Console.WriteLine(String.Format("{0,-20} {1}", s, String.Join(", ",
       m.Where(x => !String.IsNullOrEmpty(x)))));
}

Result:

*\\*                 *, \\, *
*\\\**               *, \\\*, *
*\**                 *, \*, *
test\**test2         test\*, *, test2
like image 20
Vyktor Avatar answered Oct 23 '22 09:10

Vyktor