java.nio.charset.Charset.forName("utf8").decode decodes a byte sequence of <pre class="prettyprint"><code> ED A0 80 ED B0 80 </code></pre> into the Unicode codepoint: <pre class="prettyprint"><code> U+10000 </code></pre> java.nio.charset.Charset.forName("utf8").decode also decodes a byte sequence of <pre class="prettyprint"><code> F0 90 80 80 </code></pre> into the Unicode codepoint: <pre class="prettyprint"><code> U+10000 </code></pre> This is verified by the code below. Now this seems to be telling me that the UTF-8 encoding scheme will decode <code>ED A0 80 ED B0 80</code> and <code>F0 90 80 80</code> into the same unicode codepoint. However, if I visit https://www.google.com/search?query=%ED%A0%80%ED%B0%80, I can see that it is clearly different from the page https://www.google.com/search?query=%F0%90%80%80 Since the Google Search is using UTF-8 encoding scheme (correct me if I'm wrong) as well, This suggests that the UTF-8 does not decode <code>ED A0 80 ED B0 80</code> and <code>F0 90 80 80</code> into the same unicode codepoint(s). So basically I was wondering, by the official standard, should UTF-8 decode <code>ED A0 80 ED B0 80</code> byte sequence into the Unicode codepoint U+10000 ? Code: <pre class="prettyprint"><code>public class Test { public static void main(String args[]) { java.nio.ByteBuffer bb = java.nio.ByteBuffer.wrap(new byte[] { (byte) 0xED, (byte) 0xA0, (byte) 0x80, (byte) 0xED, (byte) 0xB0, (byte) 0x80 }); java.nio.CharBuffer cb = java.nio.charset.Charset.forName("utf8").decode(bb); for (int x = 0, xx = cb.limit(); x < xx; ++x) { System.out.println(Integer.toHexString(cb.get(x))); } System.out.println(); bb = java.nio.ByteBuffer.wrap(new byte[] { (byte) 0xF0, (byte) 0x90, (byte) 0x80, (byte) 0x80 }); cb = java.nio.charset.Charset.forName("utf8").decode(bb); for (int x = 0, xx = cb.limit(); x < xx; ++x) { System.out.println(Integer.toHexString(cb.get(x))); } } } </code></pre>

<code>ED A0 80 ED B0 80</code> is the UTF-8 encoding of the UTF-16 surrogate pair <code>D800 DC00</code>. This is NOT allowed in UTF-8: <blockquote> However, pairs of UCS-2 values between D800 and DFFF (surrogate pairs in Unicode parlance)...need special treatment: the UTF-16 transformation must be undone, yielding a UCS-4 character that is then transformed as above. </blockquote> However, such an encoding is used in CESU-8 and Java's "Modified UTF-8". <blockquote> Since the Google Search is using UTF-8 encoding scheme (correct me if I'm wrong) as well, </blockquote> It appears, based on the search box, that Google is using some kind of encoding auto-detection. If you pass it <code>F0 90 80 80</code>, which is valid UTF-8, it interprets it as UTF-8 (<code>𐀀</code>). If you pass it <code>ED A0 80 ED B0 80</code>, which is invalid UTF-8, it interprets it as windows-1252 (<code>í�€í°€</code>).

Is ED A0 80 ED B0 80 a valid UTF-8 byte sequence?

Tags:

java

language-agnostic

unicode

utf-8

java.nio.charset.Charset.forName("utf8").decode decodes a byte sequence of

 ED A0 80 ED B0 80

into the Unicode codepoint:

 U+10000

java.nio.charset.Charset.forName("utf8").decode also decodes a byte sequence of

 F0 90 80 80

into the Unicode codepoint:

 U+10000

This is verified by the code below.

Now this seems to be telling me that the UTF-8 encoding scheme will decode ED A0 80 ED B0 80 and F0 90 80 80 into the same unicode codepoint.

However, if I visit https://www.google.com/search?query=%ED%A0%80%ED%B0%80,

I can see that it is clearly different from the page https://www.google.com/search?query=%F0%90%80%80

Since the Google Search is using UTF-8 encoding scheme (correct me if I'm wrong) as well,

This suggests that the UTF-8 does not decode ED A0 80 ED B0 80 and F0 90 80 80 into the same unicode codepoint(s).

So basically I was wondering, by the official standard, should UTF-8 decode ED A0 80 ED B0 80 byte sequence into the Unicode codepoint U+10000 ?

Code:

public class Test {

    public static void main(String args[]) {
        java.nio.ByteBuffer bb = java.nio.ByteBuffer.wrap(new byte[] { (byte) 0xED, (byte) 0xA0, (byte) 0x80, (byte) 0xED, (byte) 0xB0, (byte) 0x80 });
        java.nio.CharBuffer cb = java.nio.charset.Charset.forName("utf8").decode(bb);
        for (int x = 0, xx = cb.limit(); x < xx; ++x) {
            System.out.println(Integer.toHexString(cb.get(x)));
        }
        System.out.println();
        bb = java.nio.ByteBuffer.wrap(new byte[] { (byte) 0xF0, (byte) 0x90, (byte) 0x80, (byte) 0x80 });
        cb = java.nio.charset.Charset.forName("utf8").decode(bb);
        for (int x = 0, xx = cb.limit(); x < xx; ++x) {
            System.out.println(Integer.toHexString(cb.get(x)));
        }
    }
}

911

asked Jan 12 '12 23:01

Pacerier

2 Answers

ED A0 80 ED B0 80 is the UTF-8 encoding of the UTF-16 surrogate pair D800 DC00. This is NOT allowed in UTF-8:

However, pairs of UCS-2 values between D800 and DFFF (surrogate pairs in Unicode parlance)...need special treatment: the UTF-16 transformation must be undone, yielding a UCS-4 character that is then transformed as above.

However, such an encoding is used in CESU-8 and Java's "Modified UTF-8".

Since the Google Search is using UTF-8 encoding scheme (correct me if I'm wrong) as well,

It appears, based on the search box, that Google is using some kind of encoding auto-detection. If you pass it F0 90 80 80, which is valid UTF-8, it interprets it as UTF-8 (𐀀). If you pass it ED A0 80 ED B0 80, which is invalid UTF-8, it interprets it as windows-1252 (í�€í°€).

100

answered Oct 14 '22 21:10

dan04

Java's UTF8 is really a CESU-8 variant. The first case is using surrogate pairs encoded in UTF8 "style".

answered Oct 14 '22 20:10

Logan Capaldo

Related questions
                            
                                Hibernate transaction manager configurations in Spring
                            
                                What is the "VM Periodic Task Thread"?
                            
                                Is CQRS a good approach for implementing a social application on Google App Engine?
                            
                                Java Generics, support "Specialization"? Conceptual similarities to C++ Templates?
                            
                                Library for generating HMAC-SHA1 OAuth signature on Android?
                            
                                Good Reason to use java.util.Date in an API
                            
                                Jetty IOException: Too many open files
                            
                                Spring @Transactional boundaries
                            
                                how to avoid "duplicate class" in Java
                            
                                BadTokenException: Unable to add window
                            
                                How can methods throwing exceptions be inlined?
                            
                                Filter data in SQL or in Java? [closed]
                            
                                Hibernate entities without underlying tables
                            
                                Reading what's available from Socket without blocking
                            
                                How to set context path in Tomcat so one could enter the site without appending the deployed folder name?
                            
                                javax.net.ssl.SSLException: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
                            
                                Java RMI Connect Exception: Connection refused to host / timeout
                            
                                Sending email through java in gmail account having two way authentication
                            
                                SOLR performance tuning
                            
                                Jenkins/Hudson CLI API to modify the node labels using Groovy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With