Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's an "ignorable character in a Java identifier"

I stumbled across this doc and wondered what that was all about. Apparently you can have certain control characters inside identifiers and they are ignored:

public static void main(String[] args) throws Exception {
    int dummy = 123;
    System.out.println(d​ummy); // Has U+200B after the `d` before the `u`
}

I couldn't find anything about this in the JLS. IntelliJ IDEA gives an error in the editor saying "dummy" is an undeclared identifier (but nevertheless it compiles and runs). I guess that's an error in IntelliJ? What purpose do these "ignoreable characters" serve?

(Note: StackOverflow seems to remove my control characters from the question)

like image 489
Klitos Kyriacou Avatar asked Jun 22 '17 14:06

Klitos Kyriacou


People also ask

What are the characters in Java?

The data type char comes under the characters group that represents symbols i.e. alphabets and numbers in a character set. The Size of a Java char is 16-bit and the range is between 0 to 65,535. Also, the standard ASCII characters range from 0 to 127.

What is Unicode point in Java?

Definition and Usage. The codePointAt() method returns the Unicode value of the character at the specified index in a string. The index of the first character is 0, the second character is 1, and so on.

Which character code is used in Java?

In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding.

What is the character class in Java?

Class Character. The Character class wraps a value of the primitive type char in an object. An object of type Character contains a single field whose type is char . In addition, this class provides several methods for determining a character's category (lowercase letter, digit, etc.)


1 Answers

There is an open issue for this contradiction.

In summary, these characters are indeed ignored for identifier name matching by the compiler but JLS doesn't mention this. Instead JLS says:

Two identifiers are the same only if they are identical, that is, have the same Unicode character for each letter or digit.

Also

A "Java letter-or-digit" is a character for which the method Character.isJavaIdentifierPart(int) returns true

The contradiction is obvious as:

Character.isJavaIdentifierPart('\u0001')  -> true, so used to compare identifier names
Character.isIdentifierIgnorable('\u0001') -> true, should be ignored actually

I speculate that Intellij IDEA follows the JLS or they are simply unaware of ignorable characters. I don't see a bug report for this here.

As to what is the purpose of these ignorables, unicode specifies some Layout and Format Control Characters. It is suggested that these characters should be ignored in identifier names as

the effects they represent are stylistic or otherwise out of scope for identifiers, and second because the characters themselves often have no visible display

Apparently the purpose of isIdentifierIgnorable is to identify characters of this category. For instance it's mentioned in the isIdentifierIgnorable documentation that it returns true for characters that have the FORMAT general category value which are characters with unicode General_Category value of Cf which are included in the Layout and Format Control Characters

like image 136
Manos Nikolaidis Avatar answered Oct 15 '22 18:10

Manos Nikolaidis