I am trying to pick out all tokens in a text and need to match all Ascii and Unicode characters, so here is how I have laid them out.
fragment CHAR : ('A'..'Z') | ('a'..'z');
fragment DIGIT : ('0'..'9');
fragment UNICODE : '\u0000'..'\u00FF';
Now if I write my token rule as:
TOKEN : (CHAR|DIGIT|UNICODE)+;
I get "Decision can match input such as "'A'..'Z'" using multiple alternatives: 1, 3 As a result, alternative(s) 3 were disabled for that input" " Decision can match input such as "'0'..'9'" using multiple alternatives: 2, 3 As a result, alternative(s) 3 were disabled for that input"
And nothing gets matched: And also if I write it as
TOKEN : (UNICODE)+;
Nothing gets matched.
Is there a way of doing this.
As of Unicode version 15.0, there are 149,186 characters with code points, covering 161 modern and historical scripts, as well as multiple symbol sets.
Unicode characters can then be entered by holding down Alt , and typing + on the numeric keypad, followed by the hexadecimal code – using the numeric keypad for digits from 0 to 9 and letter keys for A to F – and then releasing Alt .
The Unicode character encoding standard is a fixed-length, character encoding scheme that includes characters from almost all of the living languages of the world. Information about Unicode can be found in The Unicode Standard , and from the Unicode Consortium website at www.unicode.org.
One other thing to consider if you are planning on using Unicode is that you should set the charvocabulary
option to say that you want to allow any char in the Unicode range of 0 through FFFE
options
{
charVocabulary='\u0000'..'\uFFFE';
}
The default you'll usually see in the examples is
options
{
charVocabulary = '\3'..'\377';
}
To cover the point made above. Generally if you needed both the ascii character range 'A'..'Z'
and the unicode range you'd make a unicode lexer rule like:
'\u0080'..'\ufffe'
Practically speaking, TOKEN: (UNICODE)+
is completely useless.
Since everything is a token character, if you try to use such a rule to match a Java program, say, it will simply match the whole program and return it to you as one big token.
You really do need to break your characters down into different groups if you want to split your input apart into meaningful fragments.
It might help you to take a look at how the "pros" have done it. Here is a BNF grammar for Java, and here is BNF for an identifier, which shows how they took to the trouble to group out
identifier
::= "a..z,$,_" { "a..z,$,_,0..9,unicode character over 00C0" }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With