According to the Java documentation for String.length: <blockquote> public int length() Returns the length of this string. The length is equal to the number of Unicode code units in the string. Specified by: length in interface CharSequence Returns: the length of the sequence of characters represented by this object. </blockquote> But then I don't understand why the following program, HelloUnicode.java, produces different results on different platforms. According to my understanding, the number of Unicode code units should be the same, since Java supposedly always represents strings in UTF-16: <pre class="prettyprint"><code>public class HelloWorld { public static void main(String[] args) { String myString = "I have a 🙂 in my string"; System.out.println("String: " + myString); System.out.println("Bytes: " + bytesToHex(myString.getBytes())); System.out.println("String Length: " + myString.length()); System.out.println("Byte Length: " + myString.getBytes().length); System.out.println("Substring 9 - 13: " + myString.substring(9, 13)); System.out.println("Substring Bytes: " + bytesToHex(myString.substring(9, 13).getBytes())); } // Code from https://stackoverflow.com/a/9855338/4019986 private final static char[] hexArray = "0123456789ABCDEF".toCharArray(); public static String bytesToHex(byte[] bytes) { char[] hexChars = new char[bytes.length * 2]; for ( int j = 0; j < bytes.length; j++ ) { int v = bytes[j] & 0xFF; hexChars[j * 2] = hexArray[v >>> 4]; hexChars[j * 2 + 1] = hexArray[v & 0x0F]; } return new String(hexChars); } } </code></pre> The output of this program on my Windows box is: <pre class="prettyprint"><code>String: I have a 🙂 in my string Bytes: 492068617665206120F09F998220696E206D7920737472696E67 String Length: 26 Byte Length: 26 Substring 9 - 13: 🙂 Substring Bytes: F09F9982 </code></pre> The output on my CentOS 7 machine is: <pre class="prettyprint"><code>String: I have a 🙂 in my string Bytes: 492068617665206120F09F998220696E206D7920737472696E67 String Length: 24 Byte Length: 26 Substring 9 - 13: 🙂 i Substring Bytes: F09F99822069 </code></pre> I ran both with Java 1.8. Same byte length, different String length. Why? <h3>UPDATE</h3> By replacing the "🙂" in the string with "\uD83D\uDE42", I get the following results: Windows: <pre class="prettyprint"><code>String: I have a ? in my string Bytes: 4920686176652061203F20696E206D7920737472696E67 String Length: 24 Byte Length: 23 Substring 9 - 13: ? i Substring Bytes: 3F2069 </code></pre> CentOS: <pre class="prettyprint"><code>String: I have a 🙂 in my string Bytes: 492068617665206120F09F998220696E206D7920737472696E67 String Length: 24 Byte Length: 26 Substring 9 - 13: 🙂 i Substring Bytes: F09F99822069 </code></pre> Why "\uD83D\uDE42" ends up being encoded as 0x3F on the Windows machine is beyond me... <h3>Java Versions:</h3> Windows: <pre class="prettyprint"><code>java version "1.8.0_211" Java(TM) SE Runtime Environment (build 1.8.0_211-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode) </code></pre> CentOS: <pre class="prettyprint"><code>openjdk version "1.8.0_201" OpenJDK Runtime Environment (build 1.8.0_201-b09) OpenJDK 64-Bit Server VM (build 25.201-b09, mixed mode) </code></pre> <h3>Update 2</h3> Using <code>.getBytes("utf-8")</code>, with the "🙂" embedded in the string literal, here are the outputs. Windows: <pre class="prettyprint"><code>String: I have a 🙂 in my string Bytes: 492068617665206120C3B0C5B8E284A2E2809A20696E206D7920737472696E67 String Length: 26 Byte Length: 32 Substring 9 - 13: 🙂 Substring Bytes: C3B0C5B8E284A2E2809A </code></pre> CentOS: <pre class="prettyprint"><code>String: I have a 🙂 in my string Bytes: 492068617665206120F09F998220696E206D7920737472696E67 String Length: 24 Byte Length: 26 Substring 9 - 13: 🙂 i Substring Bytes: F09F99822069 </code></pre> So yes it appears to be a difference in system encoding. But then that means string literals are encoded differently on different platforms? That sounds like it could be problematic in certain situations. Also... where is the byte sequence <code>C3B0C5B8E284A2E2809A</code> coming from to represent the smiley in Windows? That doesn't make sense to me. For completeness, using <code>.getBytes("utf-16")</code>, with the "🙂" embedded in the string literal, here are the outputs. Windows: <pre class="prettyprint"><code>String: I have a 🙂 in my string Bytes: FEFF00490020006800610076006500200061002000F001782122201A00200069006E0020006D007900200073007400720069006E0067 String Length: 26 Byte Length: 54 Substring 9 - 13: 🙂 Substring Bytes: FEFF00F001782122201A </code></pre> CentOS: <pre class="prettyprint"><code>String: I have a 🙂 in my string Bytes: FEFF004900200068006100760065002000610020D83DDE4200200069006E0020006D007900200073007400720069006E0067 String Length: 24 Byte Length: 50 Substring 9 - 13: 🙂 i Substring Bytes: FEFFD83DDE4200200069 </code></pre>

You didn't take into account, that getBytes() returns the bytes in the platform's default encoding. This is different on windows and centOS. See also How to Find the Default Charset/Encoding in Java? and the API documentation on String.getBytes().

Why is Java String.length inconsistent across platforms with unicode characters?

Tags:

java

string

encoding

According to the Java documentation for String.length:

public int length()

Returns the length of this string.

The length is equal to the number of Unicode code units in the string.

Specified by:

length in interface CharSequence

Returns:

the length of the sequence of characters represented by this object.

But then I don't understand why the following program, HelloUnicode.java, produces different results on different platforms. According to my understanding, the number of Unicode code units should be the same, since Java supposedly always represents strings in UTF-16:

public class HelloWorld {

    public static void main(String[] args) {
        String myString = "I have a 🙂 in my string";
        System.out.println("String: " + myString);
        System.out.println("Bytes: " + bytesToHex(myString.getBytes()));
        System.out.println("String Length: " + myString.length());
        System.out.println("Byte Length: " + myString.getBytes().length);
        System.out.println("Substring 9 - 13: " + myString.substring(9, 13));
        System.out.println("Substring Bytes: " + bytesToHex(myString.substring(9, 13).getBytes()));
    }

    // Code from https://stackoverflow.com/a/9855338/4019986
    private final static char[] hexArray = "0123456789ABCDEF".toCharArray();
    public static String bytesToHex(byte[] bytes) {
        char[] hexChars = new char[bytes.length * 2];
        for ( int j = 0; j < bytes.length; j++ ) {
            int v = bytes[j] & 0xFF;
            hexChars[j * 2] = hexArray[v >>> 4];
            hexChars[j * 2 + 1] = hexArray[v & 0x0F];
        }
        return new String(hexChars);
    }

}

The output of this program on my Windows box is:

String: I have a 🙂 in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 26
Byte Length: 26
Substring 9 - 13: 🙂
Substring Bytes: F09F9982

The output on my CentOS 7 machine is:

String: I have a 🙂 in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: 🙂 i
Substring Bytes: F09F99822069

I ran both with Java 1.8. Same byte length, different String length. Why?

UPDATE

By replacing the "🙂" in the string with "\uD83D\uDE42", I get the following results:

Windows:

String: I have a ? in my string
Bytes: 4920686176652061203F20696E206D7920737472696E67
String Length: 24
Byte Length: 23
Substring 9 - 13: ? i
Substring Bytes: 3F2069

CentOS:

String: I have a 🙂 in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: 🙂 i
Substring Bytes: F09F99822069

Why "\uD83D\uDE42" ends up being encoded as 0x3F on the Windows machine is beyond me...

Java Versions:

Windows:

java version "1.8.0_211"
Java(TM) SE Runtime Environment (build 1.8.0_211-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode)

CentOS:

openjdk version "1.8.0_201"
OpenJDK Runtime Environment (build 1.8.0_201-b09)
OpenJDK 64-Bit Server VM (build 25.201-b09, mixed mode)

Update 2

Using .getBytes("utf-8"), with the "🙂" embedded in the string literal, here are the outputs.

Windows:

String: I have a 🙂 in my string
Bytes: 492068617665206120C3B0C5B8E284A2E2809A20696E206D7920737472696E67
String Length: 26
Byte Length: 32
Substring 9 - 13: 🙂
Substring Bytes: C3B0C5B8E284A2E2809A

CentOS:

String: I have a 🙂 in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: 🙂 i
Substring Bytes: F09F99822069

So yes it appears to be a difference in system encoding. But then that means string literals are encoded differently on different platforms? That sounds like it could be problematic in certain situations.

Also... where is the byte sequence C3B0C5B8E284A2E2809A coming from to represent the smiley in Windows? That doesn't make sense to me.

For completeness, using .getBytes("utf-16"), with the "🙂" embedded in the string literal, here are the outputs.

Windows:

String: I have a 🙂 in my string
Bytes: FEFF00490020006800610076006500200061002000F001782122201A00200069006E0020006D007900200073007400720069006E0067
String Length: 26
Byte Length: 54
Substring 9 - 13: 🙂
Substring Bytes: FEFF00F001782122201A

CentOS:

String: I have a 🙂 in my string
Bytes: FEFF004900200068006100760065002000610020D83DDE4200200069006E0020006D007900200073007400720069006E0067
String Length: 24
Byte Length: 50
Substring 9 - 13: 🙂 i
Substring Bytes: FEFFD83DDE4200200069

870

asked May 21 '19 04:05

NanoWizard

2 Answers

You have to be careful about specifying the encodings:

when you compile the Java file, it uses some encoding for the source file. My guess is that this already broke your original String literal on compilation. This can be fixed by using the escape sequence.
after you use the escape sequence, the String.length are the same. The bytes inside the String are also the same, but what you are printing out does not show that.
the bytes printed are different because you called getBytes() and that again uses the environment or platform-specific encoding. So it was also broken (replacing unencodable smilies with question mark). You need to call getBytes("UTF-8") to be platform-independent.

So to answer the specific questions posed:

Same byte length, different String length. Why?

Because the string literal is being encoded by the java compiler, and the java compiler often uses a different encoding on different systems by default. This may result in a different number of character units per Unicode character, which results in a different string length. Passing the -encoding command line option with the same option across platforms will make them encode consistently.

Why "\uD83D\uDE42" ends up being encoded as 0x3F on the Windows machine is beyond me...

It's not encoded as 0x3F in the string. 0x3f is the question mark. Java puts this in when it is asked to output invalid characters via System.out.println or getBytes, which was the case when you encoded literal UTF-16 representations in a string with a different encoding and then tried to print it to the console and getBytes from it.

But then that means string literals are encoded differently on different platforms?

By default, yes.

Also... where is the byte sequence C3B0C5B8E284A2E2809A coming from to represent the smiley in Windows?

This is quite convoluted. The "🙂" character (Unicode code point U+1F642) is stored in the Java source file with UTF-8 encoding using the byte sequence F0 9F 99 82. The Java compiler then reads the source file using the platform default encoding, Cp1252 (Windows-1252), so it treats these UTF-8 bytes as though they were Cp1252 characters, making a 4-character string by translating each byte from Cp1252 to Unicode, resulting in U+00F0 U+0178 U+2122 U+201A. The getBytes("utf-8") call then converts this 4-character string into bytes by encoding them as utf-8. Since every character of the string is higher than hex 7F, each character is converted into 2 or more UTF-8 bytes; hence the resulting string being this long. The value of this string is not significant; it's just the result of using an incorrect encoding.

answered Nov 10 '22 16:11

5 revs, 3 users 69%

You didn't take into account, that getBytes() returns the bytes in the platform's default encoding. This is different on windows and centOS.

See also How to Find the Default Charset/Encoding in Java? and the API documentation on String.getBytes().

answered Nov 10 '22 16:11

Björn Zurmaar

Related questions
                            
                                Is there a way to paginate queries by combining query cursors using FirestoreRecyclerAdapter?
                            
                                Print response body when statusCode assert fails with restassured
                            
                                c# Anonymous Interface Implementation [duplicate]
                            
                                How can I run code analysis the same as Android Studio performs on commit?
                            
                                ThreadPoolExecutor with corePoolSize 0 should not execute tasks until task queue is full
                            
                                How do you wait for all asynchronous calls to complete in Java?
                            
                                Should I use shared mutable variable update in Java 8 Streams
                            
                                How to obtain pid from Process without illegal access warning with Java 9+?
                            
                                What possible reason could there be for removing an element from a HashSet immediately prior to re-adding it?
                            
                                Is there a standardized java enum for ISO language codes?
                            
                                In Java 11 HttpClient how to solve restricted header name: Date
                            
                                Why are these Java 8 lambdas behaving differently during type conversions?
                            
                                Profile specific custom property files in Spring boot
                            
                                java.text.Collator treats "v" and "w" as the same letter for Swedish language/locale
                            
                                How do I test my DAO update method in an AbstractTransactionalJUnit4SpringContextTests test?
                            
                                Spring Boot - Multipart file maximum upload size exception
                            
                                Connections leaking with state CLOSE_WAIT with HttpClient
                            
                                How to get rid of "Could not initialize plugin: interface org.mockito.plugins.MockMaker" when launching JUnit with Mockito using OpenJDK 12
                            
                                Hibernate:Could not read entity state from ResultSet and IllegalArgumentException:GregorianCalendar.computeTime
                            
                                What are the ways to pass threadpoolexecutor to CompletableFuture?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is Java String.length inconsistent across platforms with unicode characters?

Tags:

java

string

encoding

UPDATE

Java Versions:

Update 2

NanoWizard

People also ask

2 Answers

5 revs, 3 users 69%

Björn Zurmaar

Recent Activity

Donate For Us