I would like to parse an input string and determine if it contains a sequence of characters surrounded by double quotes ("
).
The sequence of characters itself is not allowed to contain further double quotes, unless they are escaped by a backslash, like so: \"
.
To make things more complicated, the backslashes can be escaped themselves, like so: \\
. A double quote preceded by two (or any even number of) backslashes (\\"
) is therefore not escaped.
And to make it even worse, single non-escaping backslashes (i.e. followed by neither "
nor \
) are allowed.
I'm trying to solve that with Python's re
module.
The module documentation tells us about the pipe operator A|B
:
As the target string is scanned, REs separated by
'|'
are tried from left to right. When one pattern completely matches, that branch is accepted. This means that onceA
matches,B
will not be tested further, even if it would produce a longer overall match. In other words, the'|'
operator is never greedy.
However, this doesn't work as I expected:
>>> import re
>>> re.match(r'"(\\[\\"]|[^"])*"', r'"a\"')
<_sre.SRE_Match object; span=(0, 4), match='"a\\"'>
The idea of this regex is to first check for an escaped character (\\
or \"
) and only if that's not found, check for any character that's not "
(but it could be a single \
).
This can occur an arbitrary number of times and it has to be surrounded by literal "
characters.
I would expect the string "a\"
not to match at all, but apparently it does.
I would expect \"
to match the A
part and the B
part not to be tested, but apparently it is.
I don't really know how the backtracking works in this very case, but is there a way to avoid it?
I guess it would work if I check first for the initial "
character (and remove it from the input) in a separate step.
I could then use the following regular expression to get the content of the string:
>>> re.match(r'(\\[\\"]|[^"])*', r'a\"')
<_sre.SRE_Match object; span=(0, 3), match='a\\"'>
This would include the escaped quote. Since there wouldn't be a closing quote left, I would know that overall, the given string does not match.
Do I have to do it like that or is it possible to solve this with a single regular expression and no additional manual checking?
In my real application, the "
-enclosed string is only one part of a larger pattern, so I think it would be simpler to do it all at once in a single regular expression.
I found similar questions, but those don't consider that a single non-escaping backslash can be part of the string: regex to parse string with escaped characters, Parsing for escape characters with a regular expression.
String literal syntaxUse the escape sequence \\ to represent a backslash character as part of the string. You can represent a single quotation mark symbol either by itself or with the escape sequence \' . You must use the escape sequence \" to represent a double quotation mark.
A character literal contains a sequence of characters or escape sequences enclosed in single quotation mark symbols, for example 'c' . A character literal may be prefixed with the letter L, for example L'c' . A character literal without the L prefix is an ordinary character literal or a narrow character literal.
, \t, \r, etc., What if we want to print a string which contains these escape characters? We have to print the string using repr() inbuilt function. It prints the string precisely what we give. Let's see an example.
To ignoring escape sequences in the string, we make the string as "raw string" by placing "r" before the string.
When you use "(\\[\\"]|[^"])*"
, you match "
followed by 0+ sequences of \
followed by either \
or "
, or non-"
, and then followed by a "closing" "
. Note that when your input is "a\"
, the \
is matched by the second alternative branch [^"]
(as the backslash is a valid non-"
).
You need to exclude the \
from the non-"
:
"(?:[^\\"]|\\.)*"
^^
So, we match "
, then either non-"
and non-\
(with [^\\"]
) or any escape sequence (with \\.
), 0 or more times.
However, this regex is not efficient enough as there is much backtracking going on (caused by the alternation and the quantifier). Unrolled version is:
"[^"\\]*(?:\\.[^"\\]*)*"
See the regex demo
The last pattern matches:
"
- a double quote[^"\\]*
- zero or more characters other than \
and "
(?:\\.[^"\\]*)*
- zero or more sequences of
\\.
- a backslash followed with any character but a newline[^"\\]*
- zero or more characters other than \
and "
"
- a double quoteIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With