Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is ED A0 80 ED B0 80 a valid UTF-8 byte sequence?

java.nio.charset.Charset.forName("utf8").decode decodes a byte sequence of

 ED A0 80 ED B0 80

into the Unicode codepoint:

 U+10000

java.nio.charset.Charset.forName("utf8").decode also decodes a byte sequence of

 F0 90 80 80

into the Unicode codepoint:

 U+10000

This is verified by the code below.

Now this seems to be telling me that the UTF-8 encoding scheme will decode ED A0 80 ED B0 80 and F0 90 80 80 into the same unicode codepoint.

However, if I visit https://www.google.com/search?query=%ED%A0%80%ED%B0%80,

I can see that it is clearly different from the page https://www.google.com/search?query=%F0%90%80%80

Since the Google Search is using UTF-8 encoding scheme (correct me if I'm wrong) as well,

This suggests that the UTF-8 does not decode ED A0 80 ED B0 80 and F0 90 80 80 into the same unicode codepoint(s).

So basically I was wondering, by the official standard, should UTF-8 decode ED A0 80 ED B0 80 byte sequence into the Unicode codepoint U+10000 ?

Code:

public class Test {

    public static void main(String args[]) {
        java.nio.ByteBuffer bb = java.nio.ByteBuffer.wrap(new byte[] { (byte) 0xED, (byte) 0xA0, (byte) 0x80, (byte) 0xED, (byte) 0xB0, (byte) 0x80 });
        java.nio.CharBuffer cb = java.nio.charset.Charset.forName("utf8").decode(bb);
        for (int x = 0, xx = cb.limit(); x < xx; ++x) {
            System.out.println(Integer.toHexString(cb.get(x)));
        }
        System.out.println();
        bb = java.nio.ByteBuffer.wrap(new byte[] { (byte) 0xF0, (byte) 0x90, (byte) 0x80, (byte) 0x80 });
        cb = java.nio.charset.Charset.forName("utf8").decode(bb);
        for (int x = 0, xx = cb.limit(); x < xx; ++x) {
            System.out.println(Integer.toHexString(cb.get(x)));
        }
    }
}
like image 911
Pacerier Avatar asked Jan 12 '12 23:01

Pacerier


People also ask

What is a valid UTF-8?

A valid UTF-8 character can be 1 - 4 bytes long. For a 1-byte character, the first bit is a 0 , followed by its unicode. For an n-bytes character, the first n-bits are all ones, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10 .

What is an invalid UTF-8 character?

Non-UTF-8 characters are characters that are not supported by UTF-8 encoding and, they may include symbols or characters from foreign unsupported languages. We'll get an error if we attempt to store these characters to a variable or run a file that contains them.

What is a UTF-8 sequence?

UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.

Is UTF-8 the same as UTF-8?

UTF-8 is a valid IANA character set name, whereas utf8 is not. It's not even a valid alias. it refers to an implementation-provided locale, where settings of language, territory, and codeset are implementation-defined.


2 Answers

ED A0 80 ED B0 80 is the UTF-8 encoding of the UTF-16 surrogate pair D800 DC00. This is NOT allowed in UTF-8:

However, pairs of UCS-2 values between D800 and DFFF (surrogate pairs in Unicode parlance)...need special treatment: the UTF-16 transformation must be undone, yielding a UCS-4 character that is then transformed as above.

However, such an encoding is used in CESU-8 and Java's "Modified UTF-8".

Since the Google Search is using UTF-8 encoding scheme (correct me if I'm wrong) as well,

It appears, based on the search box, that Google is using some kind of encoding auto-detection. If you pass it F0 90 80 80, which is valid UTF-8, it interprets it as UTF-8 (𐀀). If you pass it ED A0 80 ED B0 80, which is invalid UTF-8, it interprets it as windows-1252 (í�€í°€).

like image 100
dan04 Avatar answered Oct 14 '22 21:10

dan04


Java's UTF8 is really a CESU-8 variant. The first case is using surrogate pairs encoded in UTF8 "style".

like image 27
Logan Capaldo Avatar answered Oct 14 '22 20:10

Logan Capaldo