I'm doing a lexer as a part of a university course. One of the brain teasers (extra assignments that don't contribute to the scoring) our professor gave us is how could we implement comments inside string literals.
Our string literals start and end with exclamation mark. e.g. !this is a string literal!
Our comments start and end with three periods. e.g. ...This is a comment...
Removing comments from string literals was relatively straightforward. Just match string literal via /!.*!/
and remove the comment via regex. If there's more than three consecutive commas, but no ending commas, throw an error.
However, I want to take this even further. I want to implement the escaping of the exclamation mark within the string literal. Unfortunately, I can't seem to get both comments and exclamation mark escapes working together.
What I want to create are string literals that can contain both comments and exclamation mark escapes. How could this be done?
Examples:
!Normal string!
!String with escaped \! exclamation mark!
!String with a comment ... comment ...!
!String \! with both ... comments can have unescaped exclamation marks!!!... !
This is my current code that can't ignore exclamation marks inside comments:
def t_STRING_LITERAL(t):
r'![^!\\]*(?:\\.[^!\\]*)*!'
# remove the escape characters from the string
t.value = re.sub(r'\\!', "!", t.value)
# remove single line comments
t.value = re.sub(r'\.\.\.[^\r\n]*\.\.\.', "", t.value)
return t
String literal syntaxUse the escape sequence \n to represent a new-line character as part of the string. Use the escape sequence \\ to represent a backslash character as part of the string. You can represent a single quotation mark symbol either by itself or with the escape sequence \' .
A "string literal" is a sequence of characters from the source character set enclosed in double quotation marks (" "). String literals are used to represent a sequence of characters which, taken together, form a null-terminated string. You must always prefix wide-string literals with the letter L.
String literals are specified to be unmodifiable. This specification allows implementations to share copies of strings with identical text, to place string literals in read-only memory, and perform certain optimizations.
A string literal is a sequence of zero or more characters enclosed within single quotation marks. The following are examples of string literals: 'Hello, world!' 'He said, "Take it or leave it."'
Perhaps this might be another option.
Match 0+ times any character except a backslash, dot or exclamation mark using the first negated character class.
Then when you do match a character that the first character class does not matches, use an alternation to match either:
To prevent catastrophic backtracking, you can mimic an atomic group in Python using a positive lookahead with a capturing group inside. If the assertion is true, then use the backreference to \1
to match.
For example
(?<!\\)![^!\\.]*(?:(?:\.(?!\.\.)|(?=(\.{3}.*?\.{3}))\1|\\.)[^!\\.]*)*!
Explanation
(?<!\\)!
Match ! not directly preceded by \
[^!\\.]*
Match 1+ times any char except !
\
or .
(?:
Non capture group
(?:\.(?!\.\.)
Match a dot not directly followed by 2 dots|
Or(?=(\.{3}.*?\.{3}))\1
Assert and capture in group 1 from ...
to the nearest ...
|
Or\\.
Match an escaped char)
Close group[^!\\.]*
Match 1+ times any char except !
\
or .
)*!
Close non capture group and repeat 0+ times, then match !
Regex demo
Look at this regex to match string literals: https://regex101.com/r/v2bjWi/2.
(?<!\\)!(?:\\!|(?:\.\.\.(?P<comment>.*?)\.\.\.)|[^!])*?(?<!\\)!
.
(?<!\\)!
meaning unescaped exclamation mark,\\!
, comments (?:\.\.\.(?P<comment>.*?)\.\.\.)
and non-exclamation marks [^!]
.
Note that this is about as much as you can achieve with a regular expression. Any additional request, and it will not be sufficient any more.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With