Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode escape behavior in Java programs

Tags:

java

A few days ago, i was asked about this program's output:

public static void main(String[] args) {
    // \u0022 is the Unicode escape for double quote (")
    System.out.println("a\u0022.length() + \u0022b".length());
}

My first thought was this program should print the a\u0022.length() + \u0022b length, which is 16 but surprisingly, it printed 2. I know \u0022 is the unicode for " but i thought this " going to be escaped and only represent one " literal, with no special meaning. And in reality, Java somehow parsed this string as following:

System.out.println("a".length() + "b".length());

I can't wrap my head around this weird behavior, Why Unicode escapes don't behave as normal escape sequences?

Update Apparently, this was one of brain teasers of the Java Puzzlers: Traps, Pitfalls, and Corner Cases book written by Joshua Bloch and Neal Gafter. More specifically, the question was related to Puzzle 14: Escape Rout.

like image 481
Ali Dehghani Avatar asked Mar 09 '16 19:03

Ali Dehghani


People also ask

How do you escape unicode characters in Java?

According to section 3.3 of the Java Language Specification (JLS) a unicode escape consists of a backslash character (\) followed by one or more 'u' characters and four hexadecimal digits.

How do you escape unicode?

A unicode escape sequence is a backslash followed by the letter 'u' followed by four hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence with the value specified by the four digits.

What is unicode in Java with example?

Unicode is a computing industry standard designed to consistently and uniquely encode characters used in written languages throughout the world. The Unicode standard uses hexadecimal to express a character. For example, the value 0x0041 represents the Latin character A.

How do you use unicode in Java?

To print Unicode characters, enter the escape sequence “u”. Unicode sequences can be used everywhere in Java code. As long as it contains Unicode characters, it can be used as an identifier. You may use Unicode to convey comments, ids, character content, and string literals, as well as other information.


1 Answers

Why Unicode escapes doesn't behave as normal escape sequences?

Basically, they're processed at a different point in reading the input - in lexing rather than parsing, if I've got my terminology right. They're not escape sequences in character literals or string literals, they're escape sequences for the whole source file. Any character that's not part of a Unicode escape sequence can be replaced with the Unicode escape sequence. So you can write programs entirely in ASCII, which actually have variable, method and class names which are non-ASCII...

Fundamentally I believe this was a design mistake in Java, as it can cause some very weird effects (e.g. if you have the escape sequence for a line break within a // comment...) but it is what it is...

This is detailed in section 3.3 of the JLS:

A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) for the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters.

...

The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u - for example, \uxxxx becomes \uuxxxx - while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.

This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.

like image 108
Jon Skeet Avatar answered Oct 30 '22 17:10

Jon Skeet