Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ANTLR4 grammar token recognition error after import

I am using a parser grammar and a lexer grammar for antlr4 from GitHub to parse PHP in Python3.

When I use these grammars directly my PoC code works:

antlr-test.py

from antlr4 import *
# from PHPParentLexer import PHPParentLexer
# from PHPParentParser import PHPParentParser
# from PHPParentParser import PHPParentListener

from PHPLexer import PHPLexer as PHPParentLexer
from PHPParser import PHPParser as PHPParentParser
from PHPParser import PHPParserListener as PHPParentListener


class PhpGrammarListener(PHPParentListener):
    def enterFunctionInvocation(self, ctx):
        print("enterFunctionInvocation " + ctx.getText())


if __name__ == "__main__":
    scanner_input = FileStream('test.php')
    lexer = PHPParentLexer(scanner_input)
    stream = CommonTokenStream(lexer)
    parser = PHPParentParser(stream)
    tree = parser.htmlDocument()
    walker = ParseTreeWalker()
    printer = PhpGrammarListener()
    walker.walk(printer, tree)

which gives the output

/opt/local/bin/python3.4 /Users/d/PycharmProjects/name/antlr-test.py
enterFunctionInvocation echo("hi") 
enterFunctionInvocation another_method("String")
enterFunctionInvocation print("print statement")

Process finished with exit code 0

When I use the following PHPParent.g4 grammar, I get a lot of errors:

grammar PHPParent;
options { tokenVocab=PHPLexer; }
import PHPParser;

After swapping comments on pythons imports, I get this error

/opt/local/bin/python3.4 /Users/d/PycharmProjects/name/antlr-test.py
line 1:1 token recognition error at: '?'
line 1:2 token recognition error at: 'p'
line 1:3 token recognition error at: 'h'
line 1:4 token recognition error at: 'p'
line 1:5 token recognition error at: '\n'
...
line 2:8 no viable alternative at input '<('
line 2:14 mismatched input ';' expecting {<EOF>, '<', '{', '}', ')', '?>', 'list', 'global', 'continue', 'return', 'class', 'do', 'switch', 'function', 'break', 'if', 'for', 'foreach', 'while', 'new', 'clone', '&', '!', '-', '~', '@', '$', <INVALID>, 'Interface', 'abstract', 'static', Array, RequireOperator, DecimalNumber, HexNumber, OctalNumber, Float, Boolean, SingleQuotedString, DoubleQuotedString_Start, Identifier, IncrementOperator}
line 3:28 mismatched input ';' expecting {<EOF>, '<', '{', '}', ')', '?>', 'list', 'global', 'continue', 'return', 'class', 'do', 'switch', 'function', 'break', 'if', 'for', 'foreach', 'while', 'new', 'clone', '&', '!', '-', '~', '@', '$', <INVALID>, 'Interface', 'abstract', 'static', Array, RequireOperator, DecimalNumber, HexNumber, OctalNumber, Float, Boolean, SingleQuotedString, DoubleQuotedString_Start, Identifier, IncrementOperator}
line 4:28 mismatched input ';' expecting {<EOF>, '<', '{', '}', ')', '?>', 'list', 'global', 'continue', 'return', 'class', 'do', 'switch', 'function', 'break', 'if', 'for', 'foreach', 'while', 'new', 'clone', '&', '!', '-', '~', '@', '$', <INVALID>, 'Interface', 'abstract', 'static', Array, RequireOperator, DecimalNumber, HexNumber, OctalNumber, Float, Boolean, SingleQuotedString, DoubleQuotedString_Start, Identifier, IncrementOperator}

However I get no errors when running the antlr4 tool over the grammars. I'm stumped here - what could be causing this issue?

$ a4p PHPLexer.g4
warning(146): PHPLexer.g4:363:0: non-fragment lexer rule DoubleQuotedStringBody can match the empty string
$ a4p PHPParser.g4
warning(154): PHPParser.g4:523:0: rule doubleQuotedString contains an optional block with at least one alternative that can match an empty string
$ a4p PHPParent.g4
warning(154): PHPParent.g4:523:0: rule doubleQuotedString contains an optional block with at least one alternative that can match an empty string
like image 981
Diarmaid Avatar asked Apr 14 '15 14:04

Diarmaid


1 Answers

Import is ANTLR4 is kind of messy.

First, tokenVocab can not generate the lexer you need. It just means that this grammar is using the tokens of PHPLexer. If you delete PHPLexer.tokens, it won't even compile!

Take a look at PHPParser.g4 where we also use options { tokenVocab=PHPLexer; }. Yet in the python script we still need to use lexer from PHPLexer to make it work. Well, this PHPParentLexer is not useable at all. That's why you got all the error.

To generate a new lexer out of combined grammar, you need to import it like this:

grammar PHPParent;
import PHPLexer;

However, mode is not supported when importing. PHPLexer itself uses mode a lot. So it's also not an option.

Can we simply replace PHPParentLexer with PHPLexer? Sadly, no. Because PHPParentParser is generated with PHPParentLexer, they are tightly coupled and can not be used seperatly. If you use PHPLexer, PHPParentParser also won't work. As for this grammar, thanks to the error recovery, it actually works, but gives some error.

There seems to be no better way but to rewrite some of the grammar. There are definitely some design issues in this import part of ANTLR4.

like image 178
skyline75489 Avatar answered Sep 22 '22 21:09

skyline75489