Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for managing escaped characters for items like string literals

Tags:

python

regex

I would like to be able to match a string literal with the option of escaped quotations. For instance, I'd like to be able to search "this is a 'test with escaped\' values' ok" and have it properly recognize the backslash as an escape character. I've tried solutions like the following:

import re
regexc = re.compile(r"\'(.*?)(?<!\\)\'")
match = regexc.search(r""" Example: 'Foo \' Bar'  End. """)
print match.groups() 
# I want ("Foo \' Bar") to be printed above

After looking at this, there is a simple problem that the escape character being used, "\", can't be escaped itself. I can't figure out how to do that. I wanted a solution like the following, but negative lookbehind assertions need to be fixed length:

# ...
re.compile(r"\'(.*?)(?<!\\(\\\\)*)\'")
# ...

Any regex gurus able to tackle this problem? Thanks.

like image 986
Evan Fosmark Avatar asked Jan 10 '09 08:01

Evan Fosmark


4 Answers

re_single_quote = r"'[^'\\]*(?:\\.[^'\\]*)*'"

First note that MizardX's answer is 100% accurate. I'd just like to add some additional recommendations regarding efficiency. Secondly, I'd like to note that this problem was solved and optimized long ago - See: Mastering Regular Expressions (3rd Edition), (which covers this specific problem in great detail - highly recommended).

First let's look at the sub-expression to match a single quoted string which may contain escaped single quotes. If you are going to allow escaped single quotes, you had better at least allow escaped-escapes as well (which is what Douglas Leeder's answer does). But as long as you're at it, its just as easy to allow escaped-anything-else. With these requirements. MizardX is the only one who got the expression right. Here it is in both short and long format (and I've taken the liberty to write this in VERBOSE mode, with lots of descriptive comments - which you should always do for non-trivial regexes):

# MizardX's correct regex to match single quoted string:
re_sq_short = r"'((?:\\.|[^\\'])*)'"
re_sq_long = r"""
    '           # Literal opening quote
    (           # Capture group $1: Contents.
      (?:       # Group for contents alternatives
        \\.     # Either escaped anything
      | [^\\']  # or one non-quote, non-escape.
      )*        # Zero or more contents alternatives.
    )           # End $1: Contents.
    '
    """

This works and correctly matches all the following string test cases:

text01 = r"out1 'escaped-escape:        \\ ' out2"
test02 = r"out1 'escaped-quote:         \' ' out2"
test03 = r"out1 'escaped-anything:      \X ' out2"
test04 = r"out1 'two escaped escapes: \\\\ ' out2"
test05 = r"out1 'escaped-quote at end:   \'' out2"
test06 = r"out1 'escaped-escape at end:  \\' out2"

Ok, now lets begin to improve on this. First, the order of the alternatives makes a difference and one should always put the most likely alternative first. In this case, non escaped characters are more likely than escaped ones, so reversing the order will improve the regex's efficiency slightly like so:

# Better regex to match single quoted string:
re_sq_short = r"'((?:[^\\']|\\.)*)'"
re_sq_long = r"""
    '           # Literal opening quote
    (           # $1: Contents.
      (?:       # Group for contents alternatives
        [^\\']  # Either a non-quote, non-escape,
      | \\.     # or an escaped anything.
      )*        # Zero or more contents alternatives.
    )           # End $1: Contents.
    '
    """

"Unrolling-the-Loop":

This is a little better, but can be further improved (significantly) by applying Jeffrey Friedl's "unrolling-the-loop" efficiency technique (from MRE3). The above regex is not optimal because it must painstakingly apply the star quantifier to the non-capture group of two alternatives, each of which consume only one or two characters at a time. This alternation can be eliminated entirely by recognizing that a similar pattern is repeated over and over, and that an equivalent expression can be crafted to do the same thing without alternation. Here is an optimized expression to match a single quoted string and capture its contents into group $1:

# Better regex to match single quoted string:
re_sq_short = r"'([^'\\]*(?:\\.[^'\\]*)*)'"
re_sq_long = r"""
    '            # Literal opening quote
    (            # $1: Contents.
      [^'\\]*    # {normal*} Zero or more non-', non-escapes.
      (?:        # Group for {(special normal*)*} construct.
        \\.      # {special} Escaped anything.
        [^'\\]*  # More {normal*}.
      )*         # Finish up {(special normal*)*} construct.
    )            # End $1: Contents.
    '
    """

This expression gobbles up all non-quote, non-backslashes (the vast majority of most strings), in one "gulp", which drastically reduces the amount of work that the regex engine must perform. How much better you ask? Well, I entered each of the regexes presented from this question into RegexBuddy and measured how many steps it took the regex engine to complete a match on the following string (which all solutions correctly match):

'This is an example string which contains one \'internally quoted\' string.'

Here are the benchmark results on the above test string:

r"""
AUTHOR            SINGLE-QUOTE REGEX   STEPS TO: MATCH  NON-MATCH
Evan Fosmark      '(.*?)(?<!\\)'                  374     376
Douglas Leeder    '(([^\\']|\\'|\\\\)*)'          154     444
cletus/PEZ        '((?:\\'|[^'])*)(?<!\\)'        223     527
MizardX           '((?:\\.|[^\\'])*)'             221     369
MizardX(improved) '((?:[^\\']|\\.)*)'             153     369
Jeffrey Friedl    '([^\\']*(?:\\.[^\\']*)*)'       13      19
"""

These steps are the number of steps required to match the test string using the RegexBuddy debugger function. The "NON-MATCH" column is the number of steps required to declare match failure when the closing quote is removed from the test string. As you can see, the difference is significant for both the matching and non-matching cases. Note also that these efficiency improvements are only applicable to a NFA engine which uses backtracking (i.e. Perl, PHP, Java, Python, Javascript, .NET, Ruby and most others.) A DFA engine will not see any performance boost by this technique (See: Regular Expression Matching Can Be Simple And Fast).

On to the complete solution:

The goal of the original question (my interpretation), is to pick out single quoted sub-strings (which may contain escaped quotes) from a larger string. If it is known that the text outside the quoted sub-strings will never contain escaped-single-quotes, the regex above will do the job. However, to correctly match single-quoted sub-strings within a sea of text swimming with escaped-quotes and escaped-escapes and escaped-anything-elses, (which is my interpretation of what the author is after), requires parsing from the beginning of the string No, (this is what I originally thought), but it doesn't - this can be achieved using MizardX's very clever (?<!\\)(?:\\\\)* expression. Here are some test strings to exercise the various solutions:

text01 = r"out1 'escaped-escape:        \\ ' out2"
test02 = r"out1 'escaped-quote:         \' ' out2"
test03 = r"out1 'escaped-anything:      \X ' out2"
test04 = r"out1 'two escaped escapes: \\\\ ' out2"
test05 = r"out1 'escaped-quote at end:   \'' out2"
test06 = r"out1 'escaped-escape at end:  \\' out2"
test07 = r"out1           'str1' out2 'str2' out2"
test08 = r"out1 \'        'str1' out2 'str2' out2"
test09 = r"out1 \\\'      'str1' out2 'str2' out2"
test10 = r"out1 \\        'str1' out2 'str2' out2"
test11 = r"out1 \\\\      'str1' out2 'str2' out2"
test12 = r"out1         \\'str1' out2 'str2' out2"
test13 = r"out1       \\\\'str1' out2 'str2' out2"
test14 = r"out1           'str1''str2''str3' out2"

Given this test data let's see how the various solutions fare ('p'==pass, 'XX'==fail):

r"""
AUTHOR/REGEX     01  02  03  04  05  06  07  08  09  10  11  12  13  14
Douglas Leeder    p   p  XX   p   p   p   p   p   p   p   p  XX  XX  XX
  r"(?:^|[^\\])'(([^\\']|\\'|\\\\)*)'"
cletus/PEZ        p   p   p   p   p  XX   p   p   p   p   p  XX  XX  XX
  r"(?<!\\)'((?:\\'|[^'])*)(?<!\\)'"
MizardX           p   p   p   p   p   p   p   p   p   p   p   p   p   p
  r"(?<!\\)(?:\\\\)*'((?:\\.|[^\\'])*)'"
ridgerunner       p   p   p   p   p   p   p   p   p   p   p   p   p   p
  r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'"
"""

A working test script:

import re
data_list = [
    r"out1 'escaped-escape:        \\ ' out2",
    r"out1 'escaped-quote:         \' ' out2",
    r"out1 'escaped-anything:      \X ' out2",
    r"out1 'two escaped escapes: \\\\ ' out2",
    r"out1 'escaped-quote at end:   \'' out2",
    r"out1 'escaped-escape at end:  \\' out2",
    r"out1           'str1' out2 'str2' out2",
    r"out1 \'        'str1' out2 'str2' out2",
    r"out1 \\\'      'str1' out2 'str2' out2",
    r"out1 \\        'str1' out2 'str2' out2",
    r"out1 \\\\      'str1' out2 'str2' out2",
    r"out1         \\'str1' out2 'str2' out2",
    r"out1       \\\\'str1' out2 'str2' out2",
    r"out1           'str1''str2''str3' out2",
    ]

regex = re.compile(
    r"""(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'""",
    re.DOTALL)

data_cnt = 0
for data in data_list:
    data_cnt += 1
    print ("\nData string %d" % (data_cnt))
    m_cnt = 0
    for match in regex.finditer(data):
        m_cnt += 1
        if (match.group(1)):
            print("  quoted sub-string%3d = \"%s\"" %
                (m_cnt, match.group(1)))

Phew!

p.s. Thanks to MizardX for the very cool (?<!\\)(?:\\\\)* expression. Learn something new every day!

like image 77
ridgerunner Avatar answered Oct 15 '22 00:10

ridgerunner


I think this will work:

import re
regexc = re.compile(r"(?:^|[^\\])'(([^\\']|\\'|\\\\)*)'")

def check(test, base, target):
    match = regexc.search(base)
    assert match is not None, test+": regex didn't match for "+base
    assert match.group(1) == target, test+": "+target+" not found in "+base
    print "test %s passed"%test

check("Empty","''","")
check("single escape1", r""" Example: 'Foo \' Bar'  End. """,r"Foo \' Bar")
check("single escape2", r"""'\''""",r"\'")
check("double escape",r""" Example2: 'Foo \\' End. """,r"Foo \\")
check("First quote escaped",r"not matched\''a'","a")
check("First quote escaped beginning",r"\''a'","a")

The regular expression r"(?:^|[^\\])'(([^\\']|\\'|\\\\)*)'" is forward matching only the things that we want inside the string:

  1. Chars that aren't backslash or quote.
  2. Escaped quote
  3. Escaped backslash

EDIT:

Add extra regex at front to check for first quote escaped.

like image 22
Douglas Leeder Avatar answered Oct 15 '22 00:10

Douglas Leeder


Douglas Leeder's pattern ((?:^|[^\\])'(([^\\']|\\'|\\\\)*)') will fail to match "test 'test \x3F test' test" and "test \\'test' test". (String containing an escape other than quote and backslash, and string preceded by an escaped backslash.)

cletus' pattern ((?<!\\)'((?:\\'|[^'])*)(?<!\\)') will fail to match "test 'test\\' test". (String ending with an escaped backslash.)

My proposal for single-quoted strings is this:

(?<!\\)(?:\\\\)*'((?:\\.|[^\\'])*)'

For both single-quoted or double-quoted stings, you could use this:

(?<!\\)(?:\\\\)*("|')((?:\\.|(?!\1)[^\\])*)\1

Test run using Python:

Doublas Leeder´s test cases:
"''" matched successfully: ""
" Example: 'Foo \' Bar'  End. " matched successfully: "Foo \' Bar"
"'\''" matched successfully: "\'"
" Example2: 'Foo \\' End. " matched successfully: "Foo \\"
"not matched\''a'" matched successfully: "a"
"\''a'" matched successfully: "a"

cletus´ test cases:
"'testing 123'" matched successfully: "testing 123"
"'testing 123\\'" matched successfully: "testing 123\\"
"'testing 123" didn´t match, as exected.
"blah 'testing 123" didn´t match, as exected.
"blah 'testing 123'" matched successfully: "testing 123"
"blah 'testing 123' foo" matched successfully: "testing 123"
"this 'is a \' test'" matched successfully: "is a \' test"
"another \' test 'testing \' 123' \' blah" matched successfully: "testing \' 123"

MizardX´s test cases:
"test 'test \x3F test' test" matched successfully: "test \x3F test"
"test \\'test' test" matched successfully: "test"
"test 'test\\' test" matched successfully: "test\\"
like image 33
Markus Jarderot Avatar answered Oct 15 '22 00:10

Markus Jarderot


If I understand what you're saying (and I'm not sure I do) you want to find the quoted string within your string ignoring escaped quotes. Is that right? If so, try this:

/(?<!\\)'((?:\\'|[^'])*)(?<!\\)'/

Basically:

  • Start with a single quote that isn't preceded by a backslash;
  • Match zero or more occurrences of: backslash then quote or any character other than a quote;
  • End in a quote;
  • Don't group the middle parentheses (the ?: operator); and
  • The closing quote can't be preceded by a backslash.

Ok, I've tested this in Java (sorry that's more my schtick than Python but the principle is the same):

private final static String TESTS[] = {
        "'testing 123'",
        "'testing 123\\'",
        "'testing 123",
        "blah 'testing 123",
        "blah 'testing 123'",
        "blah 'testing 123' foo",
        "this 'is a \\' test'",
        "another \\' test 'testing \\' 123' \\' blah"
};

public static void main(String args[]) {
    Pattern p = Pattern.compile("(?<!\\\\)'((?:\\\\'|[^'])*)(?<!\\\\)'");
    for (String test : TESTS) {
        Matcher m = p.matcher(test);
        if (m.find()) {
            System.out.printf("%s => %s%n", test, m.group(1));
        } else {
            System.out.printf("%s doesn't match%n", test);
        }
    }
}

results:

'testing 123' => testing 123
'testing 123\' doesn't match
'testing 123 doesn't match
blah 'testing 123 doesn't match
blah 'testing 123' => testing 123
blah 'testing 123' foo => testing 123
this 'is a \' test' => is a \' test
another \' test 'testing \' 123' \' blah => testing \' 123

which seems correct.

like image 27
cletus Avatar answered Oct 15 '22 00:10

cletus