Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Substring or characterAt method for UTF8 Strings with 2+ bytes in JAVA

I'm trying to find a substring method, or characterAt method that works on string containing UTF-8 encoded text in JAVA.

Internally, JAVA works with UTF-16. This means that a String is composed of chars with a size of 2 bytes. A UTF-8 character can be up to 6 bytes in size. When JAVA stores this inside a String, it splits the UTF-8 character over multiple chars.

For example: The character U+20000 (UTF-8 Hex: F0 A0 80 80) is stored internally in JAVA as a String with two chars (UTF-16 Hex: D840 and DC00).

When you have a String containing a 4 byte UTF-8 character, and use length, the answer is "2". When you use substring(0,1), you get the first half of the character.

Some code to illustrate this:

    ByteBuffer inputBuffer = ByteBuffer.wrap(new byte[]{(byte)0xF0, (byte)0xA0, (byte)0x80, (byte)0x80});
    CharBuffer data = Charset.forName("UTF-8").decode(inputBuffer);
    String string_test = data.toString();
    int length = string_test.length();
    String first_half = string_test.substring(0, 1);
    String second_half = string_test.substring(1, 2);
    String full_character = string_test.substring(0, 2);

All this, even if unexpected, is not a bug, since JAVA works in UTF-16. Inherent UTF-8 support would be nice. But it's not there.

Does JAVA have any class in the default library, or does a class exist somewhere that provides UTF-8 support? As in:

  • utf8string.length() - returns 1 if there is one 4 byte character in
    there
  • utf8string.getCharacterAt(0) - returns the first character, not the first half of it.
  • utf8string.substring(0,1) - returns the first character, not the first half of it.

Or, what is the commonly used solution for this? Convert all non UTF-16 supported UTF-8 characters to a default UTF-16 character when reading UTF-8 files? And, as a result, loosing all information on characters in the codepoint range that UTF-16 doesn't support? That is not necessarily an issue in my specific implementation, so if there is a common way of doing this, i'd be interested.

like image 262
Wouter Avatar asked Jul 08 '13 10:07

Wouter


1 Answers

Does JAVA have any class in the default library, or does a class exist somewhere that provides UTF-8 support?

You're not after UTF-8 support really. You're after Unicode code points (plain 32-bit integers), rather than UTF-16 code units. And yes, Java provides support for this, but it's not hugely easy to work with.

For example, to get a particular code point, use String.codePointAt - bearing in mind that the index you provide is in terms of UTF-16 code units, not code points.

To find the length in code points, use String.codePointCount.

To find a substring, you need to find the offset in terms of UTF-16 code units, then use the normal substring method; use String.offsetByCodePoints to find the right index.

Basically look through the String API at all the methods which contain codePoint.

like image 152
Jon Skeet Avatar answered Nov 01 '22 16:11

Jon Skeet