I look here ANSI C grammar .
This page includes a lot of regular expressions in Lex/Flex for ANSI C.
Having a problem in understanding regular expression for string literals.
They have mentioned regular expression as \"(\\.|[^\\"])*\"
As I can understand \" this is used for double quotes, \\ is for escape character, . is for any character except escape character and * is for zero or more times.
[^\\"] implies characters except \ , " .
So, in my opinion, regular expression should be \"(\\.)*\".
Can you give some strings where above regular expression will fail?
or
Why they have used [^\\"]?
RE/flex (regex-centric, fast lexical analyzer) is a free and open source computer program written in C++ that generates fast lexical analyzers (also known as "scanners" or "lexers") in C++.
A "string literal" is a sequence of characters from the source character set enclosed in double quotation marks (" "). String literals are used to represent a sequence of characters which, taken together, form a null-terminated string.
The [] construct in a regex is essentially shorthand for an | on all of the contents. For example [abc] matches a, b or c. Additionally the - character has special meaning inside of a [] . It provides a range construct. The regex [a-z] will match any letter a through z.
The regex \"(\\.)*\" that you proposed matches strings that consist of \ symbols alternating with any characters like:
"\z\x\p\r"
This regular expression would therefore fail to match a string like:
"hello"
The string "hello" would be matched by the regex \".*\" but that would also match the string """" or "\" both of which are invalid.
To get rid of these invalid matches we can use \"[^\\"]*\", but this will now fail to match a string like "\a\a\a" which is a valid string.
As we saw \"(\\.)*\" does match this string, so all we need to do is combine these two to get \"(\\.|[^\\"])*\".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With