Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scan for a string literal allowing escaped characters?

I would like to parse an input string and determine if it contains a sequence of characters surrounded by double quotes ("). The sequence of characters itself is not allowed to contain further double quotes, unless they are escaped by a backslash, like so: \".

To make things more complicated, the backslashes can be escaped themselves, like so: \\. A double quote preceded by two (or any even number of) backslashes (\\") is therefore not escaped. And to make it even worse, single non-escaping backslashes (i.e. followed by neither " nor \) are allowed.

I'm trying to solve that with Python's re module. The module documentation tells us about the pipe operator A|B:

As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.

However, this doesn't work as I expected:

>>> import re
>>> re.match(r'"(\\[\\"]|[^"])*"', r'"a\"')
<_sre.SRE_Match object; span=(0, 4), match='"a\\"'>

The idea of this regex is to first check for an escaped character (\\ or \") and only if that's not found, check for any character that's not " (but it could be a single \). This can occur an arbitrary number of times and it has to be surrounded by literal " characters.

I would expect the string "a\" not to match at all, but apparently it does. I would expect \" to match the A part and the B part not to be tested, but apparently it is.

I don't really know how the backtracking works in this very case, but is there a way to avoid it?

I guess it would work if I check first for the initial " character (and remove it from the input) in a separate step. I could then use the following regular expression to get the content of the string:

>>> re.match(r'(\\[\\"]|[^"])*', r'a\"')
<_sre.SRE_Match object; span=(0, 3), match='a\\"'>

This would include the escaped quote. Since there wouldn't be a closing quote left, I would know that overall, the given string does not match.

Do I have to do it like that or is it possible to solve this with a single regular expression and no additional manual checking?

In my real application, the "-enclosed string is only one part of a larger pattern, so I think it would be simpler to do it all at once in a single regular expression.

I found similar questions, but those don't consider that a single non-escaping backslash can be part of the string: regex to parse string with escaped characters, Parsing for escape characters with a regular expression.

like image 536
Matthias Avatar asked May 22 '16 18:05

Matthias


People also ask

How do you escape a string literal?

String literal syntaxUse the escape sequence \\ to represent a backslash character as part of the string. You can represent a single quotation mark symbol either by itself or with the escape sequence \' . You must use the escape sequence \" to represent a double quotation mark.

Can a character literal be an escape sequence?

A character literal contains a sequence of characters or escape sequences enclosed in single quotation mark symbols, for example 'c' . A character literal may be prefixed with the letter L, for example L'c' . A character literal without the L prefix is an ordinary character literal or a narrow character literal.

How do I print a string in escape characters?

, \t, \r, etc., What if we want to print a string which contains these escape characters? We have to print the string using repr() inbuilt function. It prints the string precisely what we give. Let's see an example.

How do I ignore an escape character in a string?

To ignoring escape sequences in the string, we make the string as "raw string" by placing "r" before the string.


1 Answers

When you use "(\\[\\"]|[^"])*", you match " followed by 0+ sequences of \ followed by either \ or ", or non-", and then followed by a "closing" ". Note that when your input is "a\", the \ is matched by the second alternative branch [^"] (as the backslash is a valid non-").

You need to exclude the \ from the non-":

"(?:[^\\"]|\\.)*"
      ^^

So, we match ", then either non-" and non-\ (with [^\\"]) or any escape sequence (with \\.), 0 or more times.

However, this regex is not efficient enough as there is much backtracking going on (caused by the alternation and the quantifier). Unrolled version is:

"[^"\\]*(?:\\.[^"\\]*)*"

See the regex demo

The last pattern matches:

  • " - a double quote
  • [^"\\]* - zero or more characters other than \ and "
  • (?:\\.[^"\\]*)* - zero or more sequences of
    • \\. - a backslash followed with any character but a newline
    • [^"\\]* - zero or more characters other than \ and "
  • " - a double quote
like image 105
Wiktor Stribiżew Avatar answered Sep 27 '22 23:09

Wiktor Stribiżew