Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Syntactic predicates in ANTLR lexer rules

Introduction

Looking at the documentation, ANTLR 2 used to have something called predicated lexing, with examples like this one (inspired by Pascal):

RANGE_OR_INT
    :   ( INT ".." ) => INT  { $setType(INT); }
    |   ( INT '.' )  => REAL { $setType(REAL); }
    |   INT                  { $setType(INT); }
    ;    

The way I see it, that's essentially a positive look-ahead assertion at the beginning of the rule: If the look-ahead matches INT ".." then the first rule will apply (and match the INT portion of that input), and so on.

I haven't found something like this in ANTLR 4 yet. The 2 to 3 migration guide doesn't seem to mention this, while the 3 to 4 changes document states:

The biggest difference between ANTLR 3 and 4 is that ANTLR 4 takes any grammar you give it unless the grammar had indirect left recursion. That means we don't need syntactic predicates or backtracking so ANTLR 4 does not support that syntax; you will get a warning for using it.

This is in line with the error message I get if I leave this essentially as is:

(...)=> syntactic predicates are not supported in ANTLR 4

While I can understand how a more intelligent parser implementation would resolve these ambiguities, I fail to see how this would work for lexers.

Reproducing example

To be sure, let's try this out:

grammar Demo;
prog:   atom (',' atom)* ;
atom:   INT  { System.out.println("INT:   " + $INT.getText()); }
    |   REAL { System.out.println("REAL:  " + $REAL.getText()); }
    |   a=INT RANGE b=INT { System.out.println("RANGE: " +
                              $a.getText() + " .. " + $b.getText()); }
    ;
WS  :   (' ' | '\t' | '\n' | '\r')+ -> skip ;
INT :   ('0'..'9')+ ;
REAL:   INT '.' INT? | '.' INT ;
RANGE:  '..' ;

Save this to Demo.g, then compile and run:

$ wget -nc http://www.antlr.org/download/antlr-4.5.2-complete.jar
$ java -jar antlr-4.5.2-complete.jar Demo.g
$ javac -cp antlr-4.5.2-complete.jar Demo*.java
$ java -cp .:antlr-4.5.2-complete.jar org.antlr.v4.gui.TestRig \
  Demo prog <<< '1,2.,3.4,5 ..6,7..8'
INT:   1
REAL:  2.
REAL:  3.4
RANGE: 5 .. 6
REAL:  7.
line 1:17 extraneous input '.8' expecting {<EOF>, ','}

So it seems I was correct: while the removal of syntactic predecates might have been appropriate for the parser, the lexer won't suddenly guess the right token type.

Core Question

So how would one convert this specific example to ANTLR 4? Is there a way to express look-ahead conditions? Or perhaps a way to have a single rule like INT '..' emit two distinct tokens?

References and possible solutions

Looking at the ANTLR 4 Pascal grammar, I notice that it doesn't allow real numbers to end in . without a digit after that, so learning a solution from there doesn't appear to be an option.

I've seen Semantic predicates in ANTLR4? and syntactic predicates - Upgrading from Antlr 3 to Antlr 4. Both discuss syntactic predicates in parser rules. The latter also has an example with lexer rules, but the look-ahead is identical to the rule that follows it, which means the rules could get removed without adverse effects. This is not the case in my example above.

Answers to check previous/left token in lexer mention the emit method of the lexer, with a comment referencing the How can I emit more than a single token per lexer rule? FAQ page in the ANTLR 3 wiki, so I guess that's one approach. I will turn this into an answer if noone beats me to it and if I can get it to work in my example.

The answer to ANTLR4 negative lookahead in lexer makes use of the _input.LA(int) method to examine the look-ahead. The ANTLR 4 lexical analysis faq mentions _input.LA without going into details. This should work for the example above as well, but would be hard for scenarios where there is more than a single character of look-ahead to consider.

like image 704
MvG Avatar asked Mar 01 '16 13:03

MvG


1 Answers

Here is a very short solution:

@lexer::members { private int _pos; }
INT_RANGE: INT  { _pos=_input.index(); setType(INT); emit(); }
           '..' { _input.seek(_pos); };

This matches the whole INT '..' expression, but then rewinds the input to just after the INT where we emit the token and save the position. That position is then used at the end of the rule to rewind the input in a more permanent manner.

There is, however, a problem: the resulting tokens will have incorrect position information since the _input.seek won't affect what getCharPositionInLine returns. In this case one could do

setCharPositionInLine(getCharPositionInLine() - 2)

at the end of the rule, but that approach would not work if instead of .. one were dealing with input of variable length. I had hoped that I would be able to save the result of getCharPositionInLine() in the first action, but unfortunately that already reflects the end of the whole expression.

Looking at LexerATNSimulator.evaluatePredicate I see that this method makes an effort to restore a given position state. So we can get at the correct state by abusing a semantic predicate for its side effects:

@lexer::members {
    private int _savedIndex, _savedLine, _savedColumn;
    private boolean remember() {
        _savedIndex = _input.index();
        _savedLine = getLine();
        _savedColumn = getCharPositionInLine();
        return true;
    }
    private void recall(int type) {
        _input.seek(_savedIndex);
        setLine(_savedLine);
        setCharPositionInLine(_savedColumn);
        setType(type);
    }
}
INT_RANGE: INT { remember() }? '..' { recall(INT); } ;

Keep in mind that the semantic predicate will get executed at a point in time where it isn't yet guaranteed that the whole expression will actually match. So if you use this trick in several places, you have to be careful that you don't get remember() calls from different rules overwriting the state. If in doubt, you could use multiple such functions, or an index into an array, to make each match unambiguous.

like image 136
MvG Avatar answered Sep 20 '22 12:09

MvG