Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling String Literals which End in an Escaped Quote in ANTLR4

Tags:

antlr4

How do I write a lexer rule to match a String literal which does not end in an escaped quote?

Here's my grammar:

lexer grammar StringLexer;

// from The Definitive ANTLR 4 Reference
STRING: '"' (ESC|.)*? '"';
fragment ESC : '\\"' | '\\\\' ;

Here's my java block:

String s = "\"\\\""; // looks like "\"
StringLexer lexer = new StringLexer(new ANTLRInputStream(s)); 

Token t = lexer.nextToken();

if (t.getType() == StringLexer.STRING) {
    System.out.println("Saw a String");
}
else {
    System.out.println("Nope");
}

This outputs Saw a String. Should "\" really match STRING?

Edit: Both 280Z28 and Bart's solutions are great solutions, unfortunately I can only accept one.

like image 319
hendryau Avatar asked Jul 03 '14 15:07

hendryau


2 Answers

For properly formed input, the lexer will match the text you expect. However, the use of the non-greedy operator will not prevent it from matching something with the following form:

'"' .*? '"'

To ensure strings are tokens in the most "sane" way possible, I recommended using the following rules.

StringLiteral
  : UnterminatedStringLiteral '"'
  ;

UnterminatedStringLiteral
  : '"' (~["\\\r\n] | '\\' (. | EOF))*
  ;

If your language allows string literals to span across multiple lines, you would likely need to modify UnterminatedStringLiteral to allow matching end-of-line characters.

If you do not include the UnterminatedStringLiteral rule, the lexer will handle unterminated strings by simply ignoring the opening " character of the string and proceeding to tokenize the content of the string.

like image 93
Sam Harwell Avatar answered Oct 29 '22 14:10

Sam Harwell


Yes, "\" is matched by the STRING rule:

            STRING: '"' (ESC|.)*? '"';
                     ^       ^     ^
                     |       |     |
// matches:          "       \     "

If you don't want the . to match the backslash (and quote), do something like this:

STRING: '"' ( ESC | ~[\\"] )* '"';

And if your string can't be spread over multiple lines, do:

STRING: '"' ( ESC | ~[\\"\r\n] )* '"';
like image 25
Bart Kiers Avatar answered Oct 29 '22 13:10

Bart Kiers