Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write a regular expression to match a string literal where the escape is a doubling of the quote character?

I am writing a parser using ply that needs to identify FORTRAN string literals. These are quoted with single quotes with the escape character being doubled single quotes. i.e.

'I don''t understand what you mean'

is a valid escaped FORTRAN string.

Ply takes input in regular expression. My attempt so far does not work and I don't understand why.

t_STRING_LITERAL = r"'[^('')]*'"

Any ideas?

like image 649
Brendan Avatar asked Jan 26 '10 22:01

Brendan


People also ask

How do you match double quotes in regex?

Firstly, double quote character is nothing special in regex - it's just another character, so it doesn't need escaping from the perspective of regex. However, because Java uses double quotes to delimit String constants, if you want to create a string in Java with a double quote in it, you must escape them.

What does ?= Mean in regular expression?

?= is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).

How do you include a quote in regex?

Try putting a backslash ( \ ) followed by " .


2 Answers

A string literal is:

  1. An open single-quote, followed by:
  2. Any number of doubled-single-quotes and non-single-quotes, then
  3. A close single quote.

Thus, our regex is:

r"'(''|[^'])*'"
like image 180
Anon. Avatar answered Sep 18 '22 15:09

Anon.


You want something like this:

r"'([^']|'')*'"

This says that inside of the single quotes you can have either double quotes or a non-quote character.

The brackets define a character class, in which you list the characters that may or may not match. It doesn't allow anything more complicated than that, so trying to use parentheses and match a multiple-character sequence ('') doesn't work. Instead your [^('')] character class is equivalent to [^'()], i.e. it matches anything that's not a single quote or a left or right parenthesis.

like image 29
John Kugelman Avatar answered Sep 19 '22 15:09

John Kugelman