Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to ignore comments inside string literals

I'm doing a lexer as a part of a university course. One of the brain teasers (extra assignments that don't contribute to the scoring) our professor gave us is how could we implement comments inside string literals.

Our string literals start and end with exclamation mark. e.g. !this is a string literal!

Our comments start and end with three periods. e.g. ...This is a comment...

Removing comments from string literals was relatively straightforward. Just match string literal via /!.*!/ and remove the comment via regex. If there's more than three consecutive commas, but no ending commas, throw an error.

However, I want to take this even further. I want to implement the escaping of the exclamation mark within the string literal. Unfortunately, I can't seem to get both comments and exclamation mark escapes working together.

What I want to create are string literals that can contain both comments and exclamation mark escapes. How could this be done?

Examples:

!Normal string!
!String with escaped \! exclamation mark!
!String with a comment ... comment ...!
!String \! with both ... comments can have unescaped exclamation marks!!!... !

This is my current code that can't ignore exclamation marks inside comments:

def t_STRING_LITERAL(t):
    r'![^!\\]*(?:\\.[^!\\]*)*!'
    # remove the escape characters from the string
    t.value = re.sub(r'\\!', "!", t.value)
    # remove single line comments
    t.value = re.sub(r'\.\.\.[^\r\n]*\.\.\.', "", t.value)
    return t
like image 247
Konsta Avatar asked Oct 05 '20 14:10

Konsta


People also ask

How do you escape a string literal?

String literal syntaxUse the escape sequence \n to represent a new-line character as part of the string. Use the escape sequence \\ to represent a backslash character as part of the string. You can represent a single quotation mark symbol either by itself or with the escape sequence \' .

What characters must enclose a string literal?

A "string literal" is a sequence of characters from the source character set enclosed in double quotation marks (" "). String literals are used to represent a sequence of characters which, taken together, form a null-terminated string. You must always prefix wide-string literals with the letter L.

Are string literals modifiable?

String literals are specified to be unmodifiable. This specification allows implementations to share copies of strings with identical text, to place string literals in read-only memory, and perform certain optimizations.

What are examples of string literals?

A string literal is a sequence of zero or more characters enclosed within single quotation marks. The following are examples of string literals: 'Hello, world!' 'He said, "Take it or leave it."'


2 Answers

Perhaps this might be another option.

Match 0+ times any character except a backslash, dot or exclamation mark using the first negated character class.

Then when you do match a character that the first character class does not matches, use an alternation to match either:

  • repeat 0+ times matching either a dot that is not directly followed by 2 dots
  • or match from 3 dots to the next first match of 3 dots
  • or match only an escaped character

To prevent catastrophic backtracking, you can mimic an atomic group in Python using a positive lookahead with a capturing group inside. If the assertion is true, then use the backreference to \1 to match.

For example

(?<!\\)![^!\\.]*(?:(?:\.(?!\.\.)|(?=(\.{3}.*?\.{3}))\1|\\.)[^!\\.]*)*!

Explanation

  • (?<!\\)! Match ! not directly preceded by \
  • [^!\\.]* Match 1+ times any char except ! \ or .
  • (?: Non capture group
    • (?:\.(?!\.\.) Match a dot not directly followed by 2 dots
    • | Or
    • (?=(\.{3}.*?\.{3}))\1 Assert and capture in group 1 from ... to the nearest ...
    • | Or
    • \\. Match an escaped char
  • ) Close group
  • [^!\\.]* Match 1+ times any char except ! \ or .
  • )*! Close non capture group and repeat 0+ times, then match !

Regex demo

like image 109
The fourth bird Avatar answered Oct 18 '22 22:10

The fourth bird


Look at this regex to match string literals: https://regex101.com/r/v2bjWi/2. (?<!\\)!(?:\\!|(?:\.\.\.(?P<comment>.*?)\.\.\.)|[^!])*?(?<!\\)!.

  • It is surrounded by two (?<!\\)! meaning unescaped exclamation mark,
  • It consists of alternating escaped exclamation marks \\!, comments (?:\.\.\.(?P<comment>.*?)\.\.\.) and non-exclamation marks [^!]. Note that this is about as much as you can achieve with a regular expression. Any additional request, and it will not be sufficient any more.
like image 40
Alexander Mashin Avatar answered Oct 18 '22 22:10

Alexander Mashin