Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Character index to and from byte index

I know how to convert a character string into a byte array using a particular encoding, but how do I convert the character indexes to byte indexes (in Java)?

For instance, in UTF-32, character index i is byte index 4 * i because every UTF-32 character is 4 bytes wide. But in UTF-8, most English characters are 1 byte wide, characters in most other scripts are 2 or 3 bytes wide, and a few are 4 bytes wide. For a given string and encoding, how would I get an array of starting byte indexes for each character?

Here's an example of what I mean. The string "Hello مرحبا こんにちは" in UTF-8 has the following indexes: [0, 1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 17, 20, 23, 26, 29] because the Latin characters are 1 byte each, the Arabic characters are 2 bytes each, and the Japanese characters are 3 bytes each. (Before the cumulative sum, the array is [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 3, 3, 3, 3, 3].)

Is there a library function in Java that computes these index positions? It needs to be efficient, so I shouldn't convert each character to a separate byte array just to query its length. Is there an easy way to compute it myself, from some knowledge of Unicode? It should be possible to do in one pass, by recognizing special bytes that indicate the width of the next character.

like image 745
Jim Pivarski Avatar asked Feb 11 '23 07:02

Jim Pivarski


1 Answers

I think this can do what you want:

static int[] utf8ByteIndexes(String s) {
    int[] byteIndexes = new int[s.length()];
    int sum = 0;
    for (int i = 0; i < s.length(); i++) {
        byteIndexes[i] = sum;
        int c = s.codePointAt(i);
        if (Character.charCount(c) == 2) {
            i++;
            byteIndexes[i] = sum;
        }
        if (c <=     0x7F) sum += 1; else
        if (c <=    0x7FF) sum += 2; else
        if (c <=   0xFFFF) sum += 3; else
        if (c <= 0x1FFFFF) sum += 4; else
        throw new Error();
    }
    return byteIndexes;
}

Given a Java string, it returns an array of the UTF-8 byte indexes corresponding to each char in the String.

System.out.println(Arrays.toString(utf8ByteIndexes("Hello مرحبا こんにちは")));

Output:

[0, 1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 17, 20, 23, 26, 29]

Exotic Unicode characters above U+FFFF, those that don't fit in Java's 16-bit char type, are a bit of a nuisance. For example, Christmas tree emoji U+1F384 (🎄) is encoded using two Java "chars". For those, the function above returns the same byte index for both chars:

System.out.println(Arrays.toString(utf8ByteIndexes("x🎄y")));

Output:

[0, 1, 1, 5]

The overall cumulative byte count is correct though (the emoji takes 4 bytes if encoded in UTF-8).

like image 83
Boann Avatar answered Feb 13 '23 21:02

Boann