How do I match unicode characters in antlr

Tags:

I am trying to pick out all tokens in a text and need to match all Ascii and Unicode characters, so here is how I have laid them out.

fragment CHAR     :  ('A'..'Z') | ('a'..'z');
fragment DIGIT    :  ('0'..'9');
fragment UNICODE  :  '\u0000'..'\u00FF';

Now if I write my token rule as:

TOKEN  :  (CHAR|DIGIT|UNICODE)+;

I get "Decision can match input such as "'A'..'Z'" using multiple alternatives: 1, 3 As a result, alternative(s) 3 were disabled for that input" " Decision can match input such as "'0'..'9'" using multiple alternatives: 2, 3 As a result, alternative(s) 3 were disabled for that input"

And nothing gets matched: And also if I write it as

TOKEN  :  (UNICODE)+;

Nothing gets matched.

Is there a way of doing this.

635

asked Jan 17 '10 17:01

Lezan

2 Answers

One other thing to consider if you are planning on using Unicode is that you should set the charvocabulary option to say that you want to allow any char in the Unicode range of 0 through FFFE

options
{
charVocabulary='\u0000'..'\uFFFE';
}

The default you'll usually see in the examples is

options
{
charVocabulary = '\3'..'\377';
}

To cover the point made above. Generally if you needed both the ascii character range 'A'..'Z' and the unicode range you'd make a unicode lexer rule like: '\u0080'..'\ufffe'

176

answered Sep 25 '22 19:09

chollida

Practically speaking, TOKEN: (UNICODE)+ is completely useless.

Since everything is a token character, if you try to use such a rule to match a Java program, say, it will simply match the whole program and return it to you as one big token.

You really do need to break your characters down into different groups if you want to split your input apart into meaningful fragments.

It might help you to take a look at how the "pros" have done it. Here is a BNF grammar for Java, and here is BNF for an identifier, which shows how they took to the trouble to group out

identifier 
  ::= "a..z,$,_" { "a..z,$,_,0..9,unicode character over 00C0" }

answered Sep 25 '22 19:09

Carl Smotricz

Related questions
                            
                                Difficulties inherent in ASCII and Extended ASCII, and Unicode Compatibility?
                            
                                How to insert Indian Rupees Symbol in Database (Oracle 10g, MySql 5.0 and Sql Server 2008)?
                            
                                Unicode (hexadecimal) character literals in MySQL
                            
                                What is DOMDocument doing to my string?
                            
                                Python String encode method
                            
                                How to get support for '✖' and the like in the Emacs shell buffer?
                            
                                What are standard unicode fonts?
                            
                                UTF-16BE to UTF-16LE, and back
                            
                                Generating a random string
                            
                                Widestring to string conversion in Delphi 7
                            
                                TypeError: unsupported operand type(s) for -: 'unicode' and 'unicode', coords
                            
                                Can we switch between ASCII and Unicode
                            
                                Does python support unicode beyond basic multilingual plane?
                            
                                QString to unicode std::string
                            
                                Python string splitlines() removes certain Unicode control characters
                            
                                Swift use unicode character in localization.strings
                            
                                CSS/HTML: Code for double down/up pointing angle
                            
                                bash - Remove all Unicode Spaces and replace with Normal Space
                            
                                Python - replace unicode emojis with ASCII characters
                            
                                How to find out if there is any non ASCII character in a string with a file path

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I match unicode characters in antlr

Tags:

unicode

antlr

antlr3

Lezan

People also ask

2 Answers

chollida

Carl Smotricz

Recent Activity

Donate For Us