In exploring precisely which characters were permitted in Java identifiers, I have stumbled upon something so extremely curious that it seems nearly certain to be a bug.
I’d expected to find that Java identifiers conformed to the requirement that they start with characters that have the Unicode property ID_Start
and are followed by those with the property ID_Continue
, with an exception granted for leading underscores and for dollar signs. That did not prove to be the case, and what I found is at extreme variance with that or any other idea of a normal identifier that I have heard of.
Consider the following demonstration proving that an ASCII ESC character (octal 033) is permitted in Java identifiers:
$ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_\033 = "i am escape: \033"; System.out.println(var_\033); }})' > escape.java $ javac escape.java $ java escape | cat -v i am escape: ^[
It’s even worse than that, though. Almost infinitely worse, in fact. Even NULLs are permitted! And thousands of other code points that are not even identifier characters. I have tested this on Solaris, Linux, and a Mac running Darwin, and all give the same results.
Here is a test program that will show all these unexpected code points that Java quite outrageuosly allows as part of a legal identifier name.
#!/usr/bin/env perl # # test-java-idchars - find which bogus code points Java allows in its identifiers # # usage: test-java-idchars [low high] # e.g.: test-java-idchars 0 255 # # Without arguments, tests Unicode code points # from 0 .. 0x1000. You may go further with a # higher explicit argument. # # Produces a report at the end. # # You can ^C it prematurely to end the program then # and get a report of its progress up to that point. # # Tom Christiansen # [email protected] # Sat Jan 29 10:41:09 MST 2011 use strict; use warnings; use encoding "Latin1"; use open IO => ":utf8"; use charnames (); $| = 1; my @legal; my ($start, $stop) = (0, 0x1000); if (@ARGV != 0) { if (@ARGV == 1) { for (($stop) = @ARGV) { $_ = oct if /^0/; # support 0OCTAL, 0xHEX, 0bBINARY } } elsif (@ARGV == 2) { for (($start, $stop) = @ARGV) { $_ = oct if /^0/; } } else { die "usage: $0 [ [start] stop ]\n"; } } for my $cp ( $start .. $stop ) { my $char = chr($cp); next if $char =~ /[\s\w]/; my $type = "?"; for ($char) { $type = "Letter" if /\pL/; $type = "Mark" if /\pM/; $type = "Number" if /\pN/; $type = "Punctuation" if /\pP/; $type = "Symbol" if /\pS/; $type = "Separator" if /\pZ/; $type = "Control" if /\pC/; } my $name = $cp ? (charnames::viacode($cp) || "<missing>") : "NULL"; next if $name eq "<missing>" && $cp > 0xFF; my $msg = sprintf("U+%04X %s", $cp, $name); print "testing \\p{$type} $msg..."; open(TESTPROGRAM, ">:utf8", "testchar.java") || die $!; print TESTPROGRAM <<"End_of_Java_Program"; public class testchar { public static void main(String argv[]) { String var_$char = "variable name ends in $msg"; System.out.println(var_$char); } } End_of_Java_Program close(TESTPROGRAM) || die $!; system q{ ( javac -encoding UTF-8 testchar.java \ && \ java -Dfile.encoding=UTF-8 testchar | grep variable \ ) >/dev/null 2>&1 }; push @legal, sprintf("U+%04X", $cp) if $? == 0; if ($? && $? < 128) { print "<interrupted>\n"; exit; # from a ^C } printf "is %s in Java identifiers.\n", ($? == 0) ? uc "legal" : "forbidden"; } END { print "Legal but evil code points: @legal\n"; }
Here is a sample of running that program on just the first 33 code points that are neither whitespace nor identifier characters:
$ perl test-java-idchars 0 0x20 testing \p{Control} U+0000 NULL...is LEGAL in Java identifiers. testing \p{Control} U+0001 START OF HEADING...is LEGAL in Java identifiers. testing \p{Control} U+0002 START OF TEXT...is LEGAL in Java identifiers. testing \p{Control} U+0003 END OF TEXT...is LEGAL in Java identifiers. testing \p{Control} U+0004 END OF TRANSMISSION...is LEGAL in Java identifiers. testing \p{Control} U+0005 ENQUIRY...is LEGAL in Java identifiers. testing \p{Control} U+0006 ACKNOWLEDGE...is LEGAL in Java identifiers. testing \p{Control} U+0007 BELL...is LEGAL in Java identifiers. testing \p{Control} U+0008 BACKSPACE...is LEGAL in Java identifiers. testing \p{Control} U+000B LINE TABULATION...is forbidden in Java identifiers. testing \p{Control} U+000E SHIFT OUT...is LEGAL in Java identifiers. testing \p{Control} U+000F SHIFT IN...is LEGAL in Java identifiers. testing \p{Control} U+0010 DATA LINK ESCAPE...is LEGAL in Java identifiers. testing \p{Control} U+0011 DEVICE CONTROL ONE...is LEGAL in Java identifiers. testing \p{Control} U+0012 DEVICE CONTROL TWO...is LEGAL in Java identifiers. testing \p{Control} U+0013 DEVICE CONTROL THREE...is LEGAL in Java identifiers. testing \p{Control} U+0014 DEVICE CONTROL FOUR...is LEGAL in Java identifiers. testing \p{Control} U+0015 NEGATIVE ACKNOWLEDGE...is LEGAL in Java identifiers. testing \p{Control} U+0016 SYNCHRONOUS IDLE...is LEGAL in Java identifiers. testing \p{Control} U+0017 END OF TRANSMISSION BLOCK...is LEGAL in Java identifiers. testing \p{Control} U+0018 CANCEL...is LEGAL in Java identifiers. testing \p{Control} U+0019 END OF MEDIUM...is LEGAL in Java identifiers. testing \p{Control} U+001A SUBSTITUTE...is LEGAL in Java identifiers. testing \p{Control} U+001B ESCAPE...is LEGAL in Java identifiers. testing \p{Control} U+001C INFORMATION SEPARATOR FOUR...is forbidden in Java identifiers. testing \p{Control} U+001D INFORMATION SEPARATOR THREE...is forbidden in Java identifiers. testing \p{Control} U+001E INFORMATION SEPARATOR TWO...is forbidden in Java identifiers. testing \p{Control} U+001F INFORMATION SEPARATOR ONE...is forbidden in Java identifiers. Legal but evil code points: U+0000 U+0001 U+0002 U+0003 U+0004 U+0005 U+0006 U+0007 U+0008 U+000E U+000F U+0010 U+0011 U+0012 U+0013 U+0014 U+0015 U+0016 U+0017 U+0018 U+0019 U+001A U+001B
And here is another demo:
$ perl test-java-idchars 0x600 0x700 | grep -i legal testing \p{Control} U+0600 ARABIC NUMBER SIGN...is LEGAL in Java identifiers. testing \p{Control} U+0601 ARABIC SIGN SANAH...is LEGAL in Java identifiers. testing \p{Control} U+0602 ARABIC FOOTNOTE MARKER...is LEGAL in Java identifiers. testing \p{Control} U+0603 ARABIC SIGN SAFHA...is LEGAL in Java identifiers. testing \p{Control} U+06DD ARABIC END OF AYAH...is LEGAL in Java identifiers. Legal but evil code points: U+0600 U+0601 U+0602 U+0603 U+06DD
Can anyone please explain this seemingly insane behavior? There are many, many, many other inexplicably permitted code points all over the place, starting right off with U+0000, which is perhaps the strangest of all. If you run it on the first 0x1000 code points, you do see certain patterns appear, such as permitting any and all code points with the property Current_Symbol
. But too much else is wholly inexplicable, at least by me.
Control characters are characters that don't represent printable character yet rather serves to start particular action. Control characters are utilized to execute any action, in contrast, to print printable character on display.
Identifiers contain characters from any of: alpha, digit, underscore, and dollar sign. You can't use spaces or tabs or symbols like #, @, !, and so forth in an identifier.
A: In Java, all identifiers must begin with a letter, an underscore, or a Unicode currency character. Any other symbol, such as a number, is not valid. Furthermore, an identifier cannot have the same spelling as one of Java's reserved words.
An identifier should be of length 4-15 letters only. However, there is no limit on its length. But, it is good to follow the standard conventions. We can't use the Java reserved keywords as an identifier such as int, float, double, char, etc.
The Java Language Specification section 3.8 defers to Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart(). The latter, among other conditions, has Character.isIdentifierIgnorable(), which allows non-whitespace control characters (including whole C1 range, see the link for the list).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With