Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I match unicode characters in antlr

I am trying to pick out all tokens in a text and need to match all Ascii and Unicode characters, so here is how I have laid them out.

fragment CHAR     :  ('A'..'Z') | ('a'..'z');
fragment DIGIT    :  ('0'..'9');
fragment UNICODE  :  '\u0000'..'\u00FF';

Now if I write my token rule as:

TOKEN  :  (CHAR|DIGIT|UNICODE)+;

I get "Decision can match input such as "'A'..'Z'" using multiple alternatives: 1, 3 As a result, alternative(s) 3 were disabled for that input" " Decision can match input such as "'0'..'9'" using multiple alternatives: 2, 3 As a result, alternative(s) 3 were disabled for that input"

And nothing gets matched: And also if I write it as

TOKEN  :  (UNICODE)+;

Nothing gets matched.

Is there a way of doing this.

like image 635
Lezan Avatar asked Jan 17 '10 17:01

Lezan


People also ask

How many characters are in a Unicode string?

As of Unicode version 15.0, there are 149,186 characters with code points, covering 161 modern and historical scripts, as well as multiple symbol sets.

How do I connect to Unicode?

Unicode characters can then be entered by holding down Alt , and typing + on the numeric keypad, followed by the hexadecimal code – using the numeric keypad for digits from 0 to 9 and letter keys for A to F – and then releasing Alt .

What is a Unicode character code?

The Unicode character encoding standard is a fixed-length, character encoding scheme that includes characters from almost all of the living languages of the world. Information about Unicode can be found in The Unicode Standard , and from the Unicode Consortium website at www.unicode.org.


2 Answers

One other thing to consider if you are planning on using Unicode is that you should set the charvocabulary option to say that you want to allow any char in the Unicode range of 0 through FFFE

options
{
charVocabulary='\u0000'..'\uFFFE';
}

The default you'll usually see in the examples is

options
{
charVocabulary = '\3'..'\377';
}

To cover the point made above. Generally if you needed both the ascii character range 'A'..'Z' and the unicode range you'd make a unicode lexer rule like: '\u0080'..'\ufffe'

like image 176
chollida Avatar answered Sep 25 '22 19:09

chollida


Practically speaking, TOKEN: (UNICODE)+ is completely useless.

Since everything is a token character, if you try to use such a rule to match a Java program, say, it will simply match the whole program and return it to you as one big token.

You really do need to break your characters down into different groups if you want to split your input apart into meaningful fragments.

It might help you to take a look at how the "pros" have done it. Here is a BNF grammar for Java, and here is BNF for an identifier, which shows how they took to the trouble to group out

identifier 
  ::= "a..z,$,_" { "a..z,$,_,0..9,unicode character over 00C0" } 
like image 23
Carl Smotricz Avatar answered Sep 25 '22 19:09

Carl Smotricz