Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ANTLR4 Lexer error reporting (length of offending characters)

I'm developing a small IDE for some language using ANTLR4 and need to underline erroneous characters when the lexer fails to match them. The built in org.antlr.v4.runtime.ANTLRErrorListener implementation outputs a message to stderr in such cases, similar to this:

line 35:25 token recognition error at: 'foo\n'

I have no problem understanding how information about line and column of the error is obtained (passed as arguments to syntaxError callback), but how do I get the 'foo\n' string inside the callback?

When a parser is the source of the error, it passes the offending token as the second argument of syntaxError callback, so it becomes trivial to extract information about the start and stop offsets of the erroneous input and this is also explained in the reference book. But what about the case when the source is a lexer? The second argument in the callback is null in this case, presumably since the lexer failed to form a token.

I need the length of unmatched characters to know how much to underline, but while debugging my listener implementation I could not find this information anywhere in the supplied callback arguments (other than extracting it from the supplied error message though string manipulation, which would be just wrong). The 'foo\n' string may clearly be obtained somehow, so what am I missing?

I suspect that I might be looking in the wrong place and that I should be looking at extending DefaultErrorStrategy where error messages get formed.

like image 209
predi Avatar asked Dec 03 '22 21:12

predi


1 Answers

You should write your lexer such that a syntax error is impossible. In ANTLR 4, it is easy to do this by simply adding the following as the last rule of your lexer:

ErrorChar : . ;

By doing this, your errors are moved from the lexer to the parser.

In some cases, you can take additional steps to help users while they edit code in your IDE. For example, suppose your language supports double-quoted strings of the following form, which cannot span multiple lines:

StringLiteral : '"' ~[\r\n"]* '"';

You can improve error reporting in your IDE by using the following pair of rules:

StringLiteral : '"' ~[\r\n"]* '"';
UnterminatedStringLiteral : '"' ~[\r\n"]*;

You can then override the emit() method to treat the UnterminatedStringLiteral in a special way. As a result, the user sees a great error message and the parser sees a single StringLiteral token that it can generally handle well.

@Override
public Token emit() {
    switch (getType()) {
    case UnterminatedStringLiteral:
        setType(StringLiteral);
        Token result = super.emit();
        // you'll need to define this method
        reportError(result, "Unterminated string literal");
        return result;
    default:
        return super.emit();
    }
}
like image 198
Sam Harwell Avatar answered Feb 23 '23 04:02

Sam Harwell