What is the simplest (shortest, fewest rules, and no warnings) way to parse both valid dates and numbers in the same grammar? My problem is that a lexer rule to match a valid month (1-12) will match any occurrence of 1-12. So if I just want to match a number, I need a parse rule like:
number: (MONTH|INT);
It only gets more complex when I add lexer rules for day and year. I want a parse rule for date like this:
date: month '/' day ( '/' year )? -> ^('DATE' year month day);
I don't care if month,day & year are parse or lexer rules, just so long as I end up with the same tree structure. I also need to be able to recognize numbers elsewhere, e.g.:
foo: STRING OP number -> ^(OP STRING number);
STRING: ('a'..'z')+;
OP: ('<'|'>');
The problem is that you seem to want to perform both syntactical and semantical checking in your lexer and/or your parser. It's a common mistake, and something that is only possible in very simple languages.
What you really need to do is accept more broadly in the lexer and parser, and then perform semantic checks. How strict you are in your lexing is up to you, but you have two basic options, depending on whether or not you need to accept zeroes preceding your days of the month: 1) Be really accepting for your INTs, 2) define DATENUM to only accept those tokens that are valid days, yet not valid INTs. I recommend the second because there will be less semantic checks necessary later in the code (since INTs will then be verifiable at the syntax level and you'll only need to perform semantic checks on your dates. The first approach:
INT: '0'..'9'+;
The second approach:
DATENUM: '0' '1'..'9';
INT: '0' | SIGN? '1'..'9' '0'..'9'*;
After accepting using these rules in the lexer, your date field would be either:
date: INT '/' INT ( '/' INT )?
or:
date: (INT | DATENUM) '/' (INT | DATENUM) ('/' (INT | DATENUM) )?
After that, you would perform a semantic run over your AST to make sure that your dates are valid.
If you're dead set on performing semantic checks in your grammar, however, ANTLR allows semantic predicates in the parser, so you could make a date field that checks the values like this:
date: month=INT '/' day=INT ( year='/' INT )? { year==null ? (/* First check /*) : (/* Second check */)}
When you do this, however, you are embedding language specific code in your grammar, and it won't be portable across targets.
Using ANTLR4, here is a simple combined grammar that I used. It makes use of the lexer to match simple tokens only, leaving the parser rules to interpret dates vs numbers.
// parser rules
date
: INT SEPARATOR month SEPARATOR INT
| INT SEPARATOR month SEPARATOR INT4
| INT SEPARATOR INT SEPARATOR INT4;
month : JAN | FEB | MAR | APR | MAY | JUN | JUL | AUG | SEP | OCT | NOV | DEC ;
number : FLOAT | INT | INT4 ;
// lexer rules
FLOAT : DIGIT+ '.' DIGIT+ ;
INT4 : DIGIT DIGIT DIGIT DIGIT;
INT : DIGIT+;
JAN : [Jj][Aa][Nn] ;
FEB : [Ff][Ee][Bb] ;
MAR : [Mm][Aa][Rr] ;
APR : [Aa][Pp][Rr] ;
MAY : [Mm][Aa][Yy] ;
JUN : [Jj][Uu][Nn] ;
JUL : [Jj][Uu][Ll] ;
AUG : [Aa][Uu][Gg] ;
SEP : [Ss][Ee][Pp] ;
OCT : [Oo][Cc][Tt] ;
NOV : [Nn][Oo][Vv] ;
DEC : [Dd][Ee][Cc] ;
SEPARATOR : [/\\\-] ;
fragment DIGIT : [0-9];
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With