Antlr4 doesn't correctly recognizes unicode characters

Question

I've very simple grammar which tries to match 'é' to token E_CODE. I've tested it using TestRig tool (with -tokens option), but parser can't correctly match it. My input file was encoded in UTF-8 without BOM and I've used ANTLR version 4.4. Could somebody else also check this ? I got this output on my console:
line 1:0 token recognition error at: 'Ă'

grammar Unicode;

stat:EOF;  
E_CODE: '\u00E9' | 'é';

Bart Kiers · Accepted Answer

I tested the grammar:

grammar Unicode;

stat: E_CODE* EOF;

E_CODE: '\u00E9' | 'é';

as follows:

UnicodeLexer lexer = new UnicodeLexer(new ANTLRInputStream("\u00E9é"));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());

and the following got printed to my console:

éé<EOF>

Tested with 4.2 and 4.3 (4.4 isn't in Maven Central yet).

EDIT

Looking at the source I see TestRig takes an optional -encoding param. Have you tried setting it?

Antlr4 doesn't correctly recognizes unicode characters

Tags:

antlr4

Adrian

1 Answers

EDIT

Bart Kiers

Recent Activity

Donate For Us

Antlr4 doesn't correctly recognizes unicode characters

Tags:

antlr4

Adrian

1 Answers

EDIT

Bart Kiers

Related questions

Recent Activity

Donate For Us