Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing Newlines, EOF as End-of-Statement Marker with ANTLR3

My question is in regards to running the following grammar in ANTLRWorks:

INT :('0'..'9')+;
SEMICOLON: ';';
NEWLINE: ('\r\n'|'\n'|'\r');
STMTEND: (SEMICOLON (NEWLINE)*|NEWLINE+);

statement
    : STMTEND
    | INT STMTEND
    ;

program: statement+;

I get the following results with the following input (with program as the start rule), regardless of which newline NL (CR/LF/CRLF) or integer I choose:

"; NL" or "32; NL" parses without error. ";" or "45;" (without newlines) result in EarlyExitException. "NL" by itself parses without error. "456 NL", without the semicolon, results in MismatchedTokenException.

What I want is for a statement to be terminated by a newline, semicolon, or semicolon followed by newline, and I want the parser to eat as many contiguous newlines as it can on a termination, so "; NL NL NL NL" is just one termination, not four or five. Also, I would like the end-of-file case to be a valid termination as well, but I don't know how to do that yet.

So what's wrong with this, and how can I make this terminate nicely at EOF? I'm completely new to all of parsing, ANTLR, and EBNF, and I haven't found much material to read on it at a level somewhere in between the simple calculator example and the reference (I have The Definitive ANTLR Reference, but it really is a reference, with a quick start in the front which I haven't yet got to run outside of ANTLRWorks), so any reading suggestions (besides Wirth's 1977 ACM paper) would be helpful too. Thanks!

like image 456
chrysanhy Avatar asked Apr 17 '11 20:04

chrysanhy


1 Answers

In case of input like ";" or "45;", the token STMTEND will never be created.

";" will create a single token: SEMICOLON, and "45;" will produce: INT SEMICOLON.

What you (probably) want is that SEMICOLON and NEWLINE never make it to real tokens themselves, but they will always be a STMTEND. You can do that by making them so called "fragment" rules:

program: statement+;

statement
 : STMTEND
 | INT STMTEND
 ;

INT     : '0'..'9'+;
STMTEND : SEMICOLON NEWLINE* | NEWLINE+;

fragment SEMICOLON : ';';
fragment NEWLINE   : '\r' '\n' | '\n' | '\r';

Fragment rules are only available for other lexer rules, so they will never end up in parser (production) rules. To emphasize: the grammar above will only ever create either INT or STMTEND tokens.

like image 160
Bart Kiers Avatar answered Sep 26 '22 01:09

Bart Kiers