I would like to be able to match a string literal with the option of escaped quotations. For instance, I'd like to be able to search "this is a 'test with escaped\' values' ok" and have it properly recognize the backslash as an escape character. I've tried solutions like the following:
import re
regexc = re.compile(r"\'(.*?)(?<!\\)\'")
match = regexc.search(r""" Example: 'Foo \' Bar' End. """)
print match.groups()
# I want ("Foo \' Bar") to be printed above
After looking at this, there is a simple problem that the escape character being used, "\
", can't be escaped itself. I can't figure out how to do that. I wanted a solution like the following, but negative lookbehind assertions need to be fixed length:
# ...
re.compile(r"\'(.*?)(?<!\\(\\\\)*)\'")
# ...
Any regex gurus able to tackle this problem? Thanks.
'[^'\\]*(?:\\.[^'\\]*)*'"
First note that MizardX's answer is 100% accurate. I'd just like to add some additional recommendations regarding efficiency. Secondly, I'd like to note that this problem was solved and optimized long ago - See: Mastering Regular Expressions (3rd Edition), (which covers this specific problem in great detail - highly recommended).
First let's look at the sub-expression to match a single quoted string which may contain escaped single quotes. If you are going to allow escaped single quotes, you had better at least allow escaped-escapes as well (which is what Douglas Leeder's answer does). But as long as you're at it, its just as easy to allow escaped-anything-else. With these requirements. MizardX is the only one who got the expression right. Here it is in both short and long format (and I've taken the liberty to write this in VERBOSE
mode, with lots of descriptive comments - which you should always do for non-trivial regexes):
# MizardX's correct regex to match single quoted string:
re_sq_short = r"'((?:\\.|[^\\'])*)'"
re_sq_long = r"""
' # Literal opening quote
( # Capture group $1: Contents.
(?: # Group for contents alternatives
\\. # Either escaped anything
| [^\\'] # or one non-quote, non-escape.
)* # Zero or more contents alternatives.
) # End $1: Contents.
'
"""
This works and correctly matches all the following string test cases:
text01 = r"out1 'escaped-escape: \\ ' out2"
test02 = r"out1 'escaped-quote: \' ' out2"
test03 = r"out1 'escaped-anything: \X ' out2"
test04 = r"out1 'two escaped escapes: \\\\ ' out2"
test05 = r"out1 'escaped-quote at end: \'' out2"
test06 = r"out1 'escaped-escape at end: \\' out2"
Ok, now lets begin to improve on this. First, the order of the alternatives makes a difference and one should always put the most likely alternative first. In this case, non escaped characters are more likely than escaped ones, so reversing the order will improve the regex's efficiency slightly like so:
# Better regex to match single quoted string:
re_sq_short = r"'((?:[^\\']|\\.)*)'"
re_sq_long = r"""
' # Literal opening quote
( # $1: Contents.
(?: # Group for contents alternatives
[^\\'] # Either a non-quote, non-escape,
| \\. # or an escaped anything.
)* # Zero or more contents alternatives.
) # End $1: Contents.
'
"""
This is a little better, but can be further improved (significantly) by applying Jeffrey Friedl's "unrolling-the-loop" efficiency technique (from MRE3). The above regex is not optimal because it must painstakingly apply the star quantifier to the non-capture group of two alternatives, each of which consume only one or two characters at a time. This alternation can be eliminated entirely by recognizing that a similar pattern is repeated over and over, and that an equivalent expression can be crafted to do the same thing without alternation. Here is an optimized expression to match a single quoted string and capture its contents into group $1
:
# Better regex to match single quoted string:
re_sq_short = r"'([^'\\]*(?:\\.[^'\\]*)*)'"
re_sq_long = r"""
' # Literal opening quote
( # $1: Contents.
[^'\\]* # {normal*} Zero or more non-', non-escapes.
(?: # Group for {(special normal*)*} construct.
\\. # {special} Escaped anything.
[^'\\]* # More {normal*}.
)* # Finish up {(special normal*)*} construct.
) # End $1: Contents.
'
"""
This expression gobbles up all non-quote, non-backslashes (the vast majority of most strings), in one "gulp", which drastically reduces the amount of work that the regex engine must perform. How much better you ask? Well, I entered each of the regexes presented from this question into RegexBuddy and measured how many steps it took the regex engine to complete a match on the following string (which all solutions correctly match):
'This is an example string which contains one \'internally quoted\' string.'
Here are the benchmark results on the above test string:
r"""
AUTHOR SINGLE-QUOTE REGEX STEPS TO: MATCH NON-MATCH
Evan Fosmark '(.*?)(?<!\\)' 374 376
Douglas Leeder '(([^\\']|\\'|\\\\)*)' 154 444
cletus/PEZ '((?:\\'|[^'])*)(?<!\\)' 223 527
MizardX '((?:\\.|[^\\'])*)' 221 369
MizardX(improved) '((?:[^\\']|\\.)*)' 153 369
Jeffrey Friedl '([^\\']*(?:\\.[^\\']*)*)' 13 19
"""
These steps are the number of steps required to match the test string using the RegexBuddy debugger function. The "NON-MATCH" column is the number of steps required to declare match failure when the closing quote is removed from the test string. As you can see, the difference is significant for both the matching and non-matching cases. Note also that these efficiency improvements are only applicable to a NFA engine which uses backtracking (i.e. Perl, PHP, Java, Python, Javascript, .NET, Ruby and most others.) A DFA engine will not see any performance boost by this technique (See: Regular Expression Matching Can Be Simple And Fast).
The goal of the original question (my interpretation), is to pick out single quoted sub-strings (which may contain escaped quotes) from a larger string. If it is known that the text outside the quoted sub-strings will never contain escaped-single-quotes, the regex above will do the job. However, to correctly match single-quoted sub-strings within a sea of text swimming with escaped-quotes and escaped-escapes and escaped-anything-elses, (which is my interpretation of what the author is after), requires parsing from the beginning of the string No, (this is what I originally thought), but it doesn't - this can be achieved using MizardX's very clever (?<!\\)(?:\\\\)*
expression. Here are some test strings to exercise the various solutions:
text01 = r"out1 'escaped-escape: \\ ' out2"
test02 = r"out1 'escaped-quote: \' ' out2"
test03 = r"out1 'escaped-anything: \X ' out2"
test04 = r"out1 'two escaped escapes: \\\\ ' out2"
test05 = r"out1 'escaped-quote at end: \'' out2"
test06 = r"out1 'escaped-escape at end: \\' out2"
test07 = r"out1 'str1' out2 'str2' out2"
test08 = r"out1 \' 'str1' out2 'str2' out2"
test09 = r"out1 \\\' 'str1' out2 'str2' out2"
test10 = r"out1 \\ 'str1' out2 'str2' out2"
test11 = r"out1 \\\\ 'str1' out2 'str2' out2"
test12 = r"out1 \\'str1' out2 'str2' out2"
test13 = r"out1 \\\\'str1' out2 'str2' out2"
test14 = r"out1 'str1''str2''str3' out2"
Given this test data let's see how the various solutions fare ('p'==pass, 'XX'==fail):
r"""
AUTHOR/REGEX 01 02 03 04 05 06 07 08 09 10 11 12 13 14
Douglas Leeder p p XX p p p p p p p p XX XX XX
r"(?:^|[^\\])'(([^\\']|\\'|\\\\)*)'"
cletus/PEZ p p p p p XX p p p p p XX XX XX
r"(?<!\\)'((?:\\'|[^'])*)(?<!\\)'"
MizardX p p p p p p p p p p p p p p
r"(?<!\\)(?:\\\\)*'((?:\\.|[^\\'])*)'"
ridgerunner p p p p p p p p p p p p p p
r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'"
"""
import re
data_list = [
r"out1 'escaped-escape: \\ ' out2",
r"out1 'escaped-quote: \' ' out2",
r"out1 'escaped-anything: \X ' out2",
r"out1 'two escaped escapes: \\\\ ' out2",
r"out1 'escaped-quote at end: \'' out2",
r"out1 'escaped-escape at end: \\' out2",
r"out1 'str1' out2 'str2' out2",
r"out1 \' 'str1' out2 'str2' out2",
r"out1 \\\' 'str1' out2 'str2' out2",
r"out1 \\ 'str1' out2 'str2' out2",
r"out1 \\\\ 'str1' out2 'str2' out2",
r"out1 \\'str1' out2 'str2' out2",
r"out1 \\\\'str1' out2 'str2' out2",
r"out1 'str1''str2''str3' out2",
]
regex = re.compile(
r"""(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'""",
re.DOTALL)
data_cnt = 0
for data in data_list:
data_cnt += 1
print ("\nData string %d" % (data_cnt))
m_cnt = 0
for match in regex.finditer(data):
m_cnt += 1
if (match.group(1)):
print(" quoted sub-string%3d = \"%s\"" %
(m_cnt, match.group(1)))
p.s. Thanks to MizardX for the very cool (?<!\\)(?:\\\\)*
expression. Learn something new every day!
I think this will work:
import re
regexc = re.compile(r"(?:^|[^\\])'(([^\\']|\\'|\\\\)*)'")
def check(test, base, target):
match = regexc.search(base)
assert match is not None, test+": regex didn't match for "+base
assert match.group(1) == target, test+": "+target+" not found in "+base
print "test %s passed"%test
check("Empty","''","")
check("single escape1", r""" Example: 'Foo \' Bar' End. """,r"Foo \' Bar")
check("single escape2", r"""'\''""",r"\'")
check("double escape",r""" Example2: 'Foo \\' End. """,r"Foo \\")
check("First quote escaped",r"not matched\''a'","a")
check("First quote escaped beginning",r"\''a'","a")
The regular expression r"(?:^|[^\\])'(([^\\']|\\'|\\\\)*)'"
is forward matching only the things that we want inside the string:
EDIT:
Add extra regex at front to check for first quote escaped.
Douglas Leeder's pattern ((?:^|[^\\])'(([^\\']|\\'|\\\\)*)'
) will fail to match "test 'test \x3F test' test"
and "test \\'test' test"
. (String containing an escape other than quote and backslash, and string preceded by an escaped backslash.)
cletus' pattern ((?<!\\)'((?:\\'|[^'])*)(?<!\\)'
) will fail to match "test 'test\\' test"
. (String ending with an escaped backslash.)
My proposal for single-quoted strings is this:
(?<!\\)(?:\\\\)*'((?:\\.|[^\\'])*)'
For both single-quoted or double-quoted stings, you could use this:
(?<!\\)(?:\\\\)*("|')((?:\\.|(?!\1)[^\\])*)\1
Test run using Python:
Doublas Leeder´s test cases:
"''" matched successfully: ""
" Example: 'Foo \' Bar' End. " matched successfully: "Foo \' Bar"
"'\''" matched successfully: "\'"
" Example2: 'Foo \\' End. " matched successfully: "Foo \\"
"not matched\''a'" matched successfully: "a"
"\''a'" matched successfully: "a"
cletus´ test cases:
"'testing 123'" matched successfully: "testing 123"
"'testing 123\\'" matched successfully: "testing 123\\"
"'testing 123" didn´t match, as exected.
"blah 'testing 123" didn´t match, as exected.
"blah 'testing 123'" matched successfully: "testing 123"
"blah 'testing 123' foo" matched successfully: "testing 123"
"this 'is a \' test'" matched successfully: "is a \' test"
"another \' test 'testing \' 123' \' blah" matched successfully: "testing \' 123"
MizardX´s test cases:
"test 'test \x3F test' test" matched successfully: "test \x3F test"
"test \\'test' test" matched successfully: "test"
"test 'test\\' test" matched successfully: "test\\"
If I understand what you're saying (and I'm not sure I do) you want to find the quoted string within your string ignoring escaped quotes. Is that right? If so, try this:
/(?<!\\)'((?:\\'|[^'])*)(?<!\\)'/
Basically:
Ok, I've tested this in Java (sorry that's more my schtick than Python but the principle is the same):
private final static String TESTS[] = {
"'testing 123'",
"'testing 123\\'",
"'testing 123",
"blah 'testing 123",
"blah 'testing 123'",
"blah 'testing 123' foo",
"this 'is a \\' test'",
"another \\' test 'testing \\' 123' \\' blah"
};
public static void main(String args[]) {
Pattern p = Pattern.compile("(?<!\\\\)'((?:\\\\'|[^'])*)(?<!\\\\)'");
for (String test : TESTS) {
Matcher m = p.matcher(test);
if (m.find()) {
System.out.printf("%s => %s%n", test, m.group(1));
} else {
System.out.printf("%s doesn't match%n", test);
}
}
}
results:
'testing 123' => testing 123
'testing 123\' doesn't match
'testing 123 doesn't match
blah 'testing 123 doesn't match
blah 'testing 123' => testing 123
blah 'testing 123' foo => testing 123
this 'is a \' test' => is a \' test
another \' test 'testing \' 123' \' blah => testing \' 123
which seems correct.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With