Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing JavaScript regex with ANTLR

I have an ANTLR JavaScript grammar (taken from the Internet), which seem to support everything except for regex literals.

The problem with a regex literal is that you have two rules, essentially:

multiplicativeExpression
    : unaryExpression (LT!* ('*' | '/' | '%')^ LT!* unaryExpression)*

and

regexLiteral
    : '/' RegexLiteralChar* '/'

where the rule RegexLiteralChar uses different lexer rules than a normal expression (eg. a double quote does not terminate it).

This means that I need to, in some way, change some kind of lexer state from my parser. How can I do this? Is it even possible?

like image 904
erikkallen Avatar asked Aug 31 '12 08:08

erikkallen


1 Answers

Looking at the grammar mentioned in the comment by Bart Kiers here, you can see this comment,

The major challenges faced in defining this grammar were:

-1- Ambiguity surrounding the DIV sign in relation to the multiplicative expression and the regular expression literal. This is solved with some lexer driven magic: a gated semantical predicate turns the recognition of regular expressions on or off, based on the value of the RegularExpressionsEnabled property. When regular expressions are enabled they take precedence over division expressions. The decision whether regular expressions are enabled is based on the heuristics that the previous token can be considered as last token of a left-hand-side operand of a division.

...

The areRegularExpressionsEnabled() function is defined as,

private final boolean areRegularExpressionsEnabled()
{
    if (last == null)
    {
        return true;
    }
    switch (last.getType())
    {
    // identifier
        case Identifier:
    // literals
        case NULL:
        case TRUE:
        case FALSE:
        case THIS:
        case OctalIntegerLiteral:
        case DecimalLiteral:
        case HexIntegerLiteral:
        case StringLiteral:
    // member access ending 
        case RBRACK:
    // function call or nested expression ending
        case RPAREN:
            return false;
    // otherwise OK
        default:
            return true;
    }
}

And then the function is used in the RegularExpressionLiteral expression,

RegularExpressionLiteral
    : { areRegularExpressionsEnabled() }?=> DIV RegularExpressionFirstChar RegularExpressionChar* DIV IdentifierPart*
    ;
like image 62
sbridges Avatar answered Oct 22 '22 00:10

sbridges