On page 74 of the ANTRL4 book it says that any Unicode character can be used in a grammar simply by specifying its codepoint in this manner:
'\uxxxx'
where xxxx
is the hexadecimal value for the Unicode codepoint.
So I used that technique in a token rule for an ID token:
grammar ID;
id : ID EOF ;
ID : ('a' .. 'z' | 'A' .. 'Z' | '\u0100' .. '\u017E')+ ;
WS : [ \t\r\n]+ -> skip ;
When I tried to parse this input:
Gŭnter
ANTLR throws an error, saying that it does not recognize ŭ
. (The ŭ character is hex 016D, so it is within the range specified)
What am I doing wrong please?
For those having the same problem using antlr4 in java code, ANTLRInputStream
beeing deprecated, here is a working way to pass multi-char unicode data from a String
to a the MyLexer
lexer :
String myString = "\u2013";
CharBuffer charBuffer = CharBuffer.wrap(myString.toCharArray());
CodePointBuffer codePointBuffer = CodePointBuffer.withChars(charBuffer);
CodePointCharStream cpcs = CodePointCharStream.fromBuffer(codePointBuffer);
OneLexer lexer = new MyLexer(cpcs);
CommonTokenStream tokens = new CommonTokenStream(lexer);
ANTLR is ready to accept 16-bit characters but, by default, many locales will read in characters as bytes (8 bits). You need to specify the appropriate encoding when you read from the file using the Java libraries. If you are using the TestRig
, perhaps through alias/script grun
, then use argument -encoding utf-8
or whatever. If you look at the source code of that class, you will see the following mechanism:
InputStream is = new FileInputStream(inputFile);
Reader r = new InputStreamReader(is, encoding); // e.g., euc-jp or utf-8
ANTLRInputStream input = new ANTLRInputStream(r);
XLexer lexer = new XLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
...
Grammar:
NAME:
[A-Za-z][0-9A-Za-z\u0080-\uFFFF_]+
;
Java:
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.TokenStream;
import com.thalesgroup.dms.stimulus.StimulusParser.SystemContext;
final class RequirementParser {
static SystemContext parse( String requirement ) {
requirement = requirement.replaceAll( "\t", " " );
final CharStream charStream = CharStreams.fromString( requirement );
final StimulusLexer lexer = new StimulusLexer( charStream );
final TokenStream tokens = new CommonTokenStream( lexer );
final StimulusParser parser = new StimulusParser( tokens );
final SystemContext system = parser.system();
if( parser.getNumberOfSyntaxErrors() > 0 ) {
Debug.format( requirement );
}
return system;
}
private RequirementParser() {/**/}
}
Source:
Lexers and Unicode text
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With