Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are "connecting characters" in Java identifiers?

People also ask

How many characters are allowed in Java for identifiers?

Java identifiers are case-sensitive. There is no limit on the length of the identifier but it is advisable to use an optimum length of 4 – 15 letters only. Reserved Words can't be used as an identifier. For example “int while = 20;” is an invalid statement as while is a reserved word.

What are identifiers in Java programming?

An identifier is a sequence of one or more characters. The first character must be a valid first character (letter, $, _) in an identifier of the Java programming language, hereafter in this chapter called simply “Java”.

What character representation does Java use?

The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).


Here is a list of connecting characters. These are characters used to connect words.

http://www.fileformat.info/info/unicode/category/Pc/list.htm

U+005F _ LOW LINE
U+203F ‿ UNDERTIE
U+2040 ⁀ CHARACTER TIE
U+2054 ⁔ INVERTED UNDERTIE
U+FE33 ︳ PRESENTATION FORM FOR VERTICAL LOW LINE
U+FE34 ︴ PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
U+FE4D ﹍ DASHED LOW LINE
U+FE4E ﹎ CENTRELINE LOW LINE
U+FE4F ﹏ WAVY LOW LINE
U+FF3F _ FULLWIDTH LOW LINE

This compiles on Java 7.

int _, ‿, ⁀, ⁔, ︳, ︴, ﹍, ﹎, ﹏, _;

An example. In this case tp is the name of a column and the value for a given row.

Column<Double> ︴tp︴ = table.getColumn("tp", double.class);

double tp = row.getDouble(︴tp︴);

The following

for (int i = Character.MIN_CODE_POINT; i <= Character.MAX_CODE_POINT; i++)
    if (Character.isJavaIdentifierStart(i) && !Character.isAlphabetic(i))
        System.out.print((char) i + " ");
}

prints

$ _ ¢ £ ¤ ¥ ؋ ৲ ৳ ৻ ૱ ௹ ฿ ៛ ‿ ⁀ ⁔ ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱ ₲ ₳ ₴ ₵ ₶ ₷ ₸ ₹ ꠸ ﷼ ︳ ︴ ﹍ ﹎ ﹏ ﹩ $ _ ¢ £ ¥ ₩


iterate through the whole 65k chars and ask Character.isJavaIdentifierStart(c). The answer is : "undertie" decimal 8255


The definitive specification of a legal Java identifier can be found in the Java Language Specification.


Here is a List of connector Characters in Unicode. You will not find them on your keyboard.

U+005F LOW LINE _
U+203F UNDERTIE ‿
U+2040 CHARACTER TIE ⁀
U+2054 INVERTED UNDERTIE ⁔
U+FE33 PRESENTATION FORM FOR VERTICAL LOW LINE ︳
U+FE34 PRESENTATION FORM FOR VERTICAL WAVY LOW LINE ︴
U+FE4D DASHED LOW LINE ﹍
U+FE4E CENTRELINE LOW LINE ﹎
U+FE4F WAVY LOW LINE ﹏
U+FF3F FULLWIDTH LOW LINE _


A connecting character is used to connect two characters.

In Java, a connecting character is the one for which Character.getType(int codePoint)/Character.getType(char ch) returns a value equal to Character.CONNECTOR_PUNCTUATION.

Note that in Java, the character information is based on Unicode standard which identifies connecting characters by assigning them the general category Pc, which is an alias for Connector_Punctuation.

The following code snippet,

for (int i = Character.MIN_CODE_POINT; i <= Character.MAX_CODE_POINT; i++) {
    if (Character.getType(i) == Character.CONNECTOR_PUNCTUATION
            && Character.isJavaIdentifierStart(i)) {
        System.out.println("character: " + String.valueOf(Character.toChars(i))
                + ", codepoint: " + i + ", hexcode: " + Integer.toHexString(i));
    }
}

prints the connecting characters that can be used to start an identifer on jdk1.6.0_45

character: _, codepoint: 95, hexcode: 5f
character: ‿, codepoint: 8255, hexcode: 203f
character: ⁀, codepoint: 8256, hexcode: 2040
character: ⁔, codepoint: 8276, hexcode: 2054
character: ・, codepoint: 12539, hexcode: 30fb
character: ︳, codepoint: 65075, hexcode: fe33
character: ︴, codepoint: 65076, hexcode: fe34
character: ﹍, codepoint: 65101, hexcode: fe4d
character: ﹎, codepoint: 65102, hexcode: fe4e
character: ﹏, codepoint: 65103, hexcode: fe4f
character: _, codepoint: 65343, hexcode: ff3f
character: ・, codepoint: 65381, hexcode: ff65

The following compiles on jdk1.6.0_45,

int _, ‿, ⁀, ⁔, ・, ︳, ︴, ﹍, ﹎, ﹏, _, ・ = 0;

Apparently, the above declaration fails to compile on jdk1.7.0_80 & jdk1.8.0_51 for the following two connecting characters (backward compatibility...oops!!!),

character: ・, codepoint: 12539, hexcode: 30fb
character: ・, codepoint: 65381, hexcode: ff65

Anyway, details aside, the exam focuses only on the Basic Latin character set.

Also, for legal identifers in Java, the spec is provided here. Use the Character class APIs to get more details.


One of the most, well, fun characters that is allowed in Java identifiers (however not at the start) is the unicode character named "Zero Width Non Joiner" (&zwnj;, U+200C, https://en.wikipedia.org/wiki/Zero-width_non-joiner).

I had this once in a piece of XML inside an attribute value holding a reference to another piece of that XML. Since the ZWNJ is "zero width" it cannot be seen (except when walking along with the cursor, it is displayed right on the character before). It also couldn't be seen in the logfile and/or console output. But it was there all the time: copy & paste into search fields got it and thus did not find the referred position. Typing the (visible part of the) string into the search field however found the referred position. Took me a while to figure this out.

Typing a Zero-Width-Non-Joiner is actually quite easy (too easy) when using the European keyboard layout, at least in its German variant, e.g. "Europatastatur 2.02" - it is reachable with AltGr + ".", two keys which unfortunately are located directly next to each other on most keyboards and can easily be hit together accidentally.

Back to Java: I thought well, you could write some code like this:

void foo() {
    int i = 1;
    int i‌ = 2;
}

with the second i appended by a zero-width-non-joiner (can't do that in the above code snipped in stackoverflow's editor), but that didn't work. IntelliJ (16.3.3) did not complain, but JavaC (Java 8) did complain about an already defined identifier - it seems JavaC actually allows the ZWNJ character as part of an identifier, but when using reflection to see what it does, the ZWNJ character is stripped off the identifier - something that characters like ‿ aren't.