Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does this code, written backwards, print "Hello World!"

There are invisible characters here that alter how the code is displayed. In Intellij these can be found by copy-pasting the code into an empty string (""), which replaces them with Unicode escapes, removing their effects and revealing the order the compiler sees.

Here is the output of that copy-paste:

"class M\u202E{public static void main(String[]a\u202D){System.out.print(new char[]\n"+
        "{'H','e','l','l','o',' ','W','o','r','l','d','!'});}}   "

The source code characters are stored in this order, and the compiler treats them as being in this order, but they're displayed differently.

Note the \u202E character, which is a right-to-left override, starting a block where all characters are forced to be displayed right-to-left, and the \u202D, which is a left-to-right override, starting a nested block where all characters are forced into left-to-right order, overriding the first override.

Ergo, when it displays the original code, class M is displayed normally, but the \u202E reverses the display order of everything from there to the \u202D, which reverses everything again. (Formally, everything from the \u202D to the line terminator gets reversed twice, once due to the \u202D and once with the rest of the text reversed due to the \u202E, which is why this text shows up in the middle of the line instead of the end.) The next line's directionality is handled independently of the first's due to the line terminator, so {'H','e','l','l','o',' ','W','o','r','l','d','!'});}} is displayed normally.

For the full (extremely complex, dozens of pages long) Unicode bidirectional algorithm, see Unicode Standard Annex #9.


It looks different because of the Unicode Bidirectional Algorithm. There are two invisible characters of RLO and LRO that the Unicode Bidirectional Algorithm uses to change the visual appearance of the characters nested between these two metacharacters.

The result is that visually they look in reverse order, but the actual characters in memory are not reversed. You can analyse the results here. The Java compiler will ignore RLO and LRO, and treat them as whitespace which is why the code compiles.

Note 1: This algorithm is used by text editors and browsers to visually display characters both LTR characters (English) and RTL characters (e.g. Arabic, Hebrew) together at the same time - hence "bi"-directional. You can read more about the Bidirectional Algorithm at Unicode's website.
Note 2: The exact behaviour of LRO and RLO is defined in Section 2.2 of the Algorithm.


The Character U+202E mirrors the code from right to left, it is very clever though. Is hidden starting in the M,

"class M\u202E{..."

How did I found the magic behind this?

Well, at first when I saw the question I tough, "it's a kind of joke, to lose somebody else time", but then, I opened my IDE ("IntelliJ"), create a class, and past the code... and it compiled!!! So, I took a better look and saw that the "public static void" was backward, so I went there with the cursor, and erase a few chars... And what happens? The chars started erasing backward, so, I thought mmm.... rare... I have to execute it... So I proceed to execute the program, but first I needed to save it... and that was when I found it!. I couldn't save the file because my IDE said that there was a different encoding for some char, and point me where was it, So I start a research in Google for special chars that could do the job, and that's it :)

A little about

the Unicode Bidirectional Algorithm, and U+202E involved, a briefly explain:

The Unicode Standard prescribes a memory representation order known as logical order. When text is presented in horizontal lines, most scripts display characters from left to right. However, there are several scripts (such as Arabic or Hebrew) where the natural ordering of horizontal text in display is from right to left. If all of the text has a uniform horizontal direction, then the ordering of the display text is unambiguous.

However, because these right-to-left scripts use digits that are written from left to right, the text is actually bi-directional: a mixture of right-to-left and left-to-right text. In addition to digits, embedded words from English and other scripts are also written from left to right, also producing bidirectional text. Without a clear specification, ambiguities can arise in determining the ordering of the displayed characters when the horizontal direction of the text is not uniform.

This annex describes the algorithm used to determine the directionality for bidirectional Unicode text. The algorithm extends the implicit model currently employed by a number of existing implementations and adds explicit formatting characters for special circumstances. In most cases, there is no need to include additional information with the text to obtain correct display ordering.

However, in the case of bidirectional text, there are circumstances where an implicit bidirectional ordering is not sufficient to produce comprehensible text. To deal with these cases, a minimal set of directional formatting characters is defined to control the ordering of characters when rendered. This allows exact control of the display ordering for legible interchange and ensures that plain text used for simple items like filenames or labels can always be correctly ordered for display.

Why create some algorithm like this?

the bidi algorithm can render a sequence of Arabic or Hebrew characters one after the other from right to left.


Chapter 3 of the language specification provides an explanation by describing in detail how the lexical translation is done for a Java program. What matters most for the question:

Programs are written in Unicode (§3.1), but lexical translations are provided (§3.2) so that Unicode escapes (§3.3) can be used to include any Unicode character using only ASCII characters.

So a program is written in Unicode characters, and the author can escape them using \uxxxx in case the file encoding does not support the Unicode character, in which case it is translated to the appropriate character. One of the Unicode characters present in this case is \u202E. It is not visually shown in the snippet, but if you try switching the encoding of the browser, the hidden characters may appear.

Therefore, the lexical translation results in the class declaration:

class M\u202E{

which means that the class identifier is M\u202E. The specification considers this as a valid identifer:

Identifier:
    IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral
IdentifierChars:
    JavaLetter {JavaLetterOrDigit}

A "Java letter-or-digit" is a character for which the method Character.isJavaIdentifierPart(int) returns true.