I know how to convert a character string into a byte array using a particular encoding, but how do I convert the character indexes to byte indexes (in Java)?
For instance, in UTF-32, character index i
is byte index 4 * i
because every UTF-32 character is 4 bytes wide. But in UTF-8, most English characters are 1 byte wide, characters in most other scripts are 2 or 3 bytes wide, and a few are 4 bytes wide. For a given string and encoding, how would I get an array of starting byte indexes for each character?
Here's an example of what I mean. The string "Hello مرحبا こんにちは"
in UTF-8 has the following indexes: [0, 1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 17, 20, 23, 26, 29]
because the Latin characters are 1 byte each, the Arabic characters are 2 bytes each, and the Japanese characters are 3 bytes each. (Before the cumulative sum, the array is [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 3, 3, 3, 3, 3]
.)
Is there a library function in Java that computes these index positions? It needs to be efficient, so I shouldn't convert each character to a separate byte array just to query its length. Is there an easy way to compute it myself, from some knowledge of Unicode? It should be possible to do in one pass, by recognizing special bytes that indicate the width of the next character.
I think this can do what you want:
static int[] utf8ByteIndexes(String s) {
int[] byteIndexes = new int[s.length()];
int sum = 0;
for (int i = 0; i < s.length(); i++) {
byteIndexes[i] = sum;
int c = s.codePointAt(i);
if (Character.charCount(c) == 2) {
i++;
byteIndexes[i] = sum;
}
if (c <= 0x7F) sum += 1; else
if (c <= 0x7FF) sum += 2; else
if (c <= 0xFFFF) sum += 3; else
if (c <= 0x1FFFFF) sum += 4; else
throw new Error();
}
return byteIndexes;
}
Given a Java string, it returns an array of the UTF-8 byte indexes corresponding to each char
in the String.
System.out.println(Arrays.toString(utf8ByteIndexes("Hello مرحبا こんにちは")));
Output:
[0, 1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 17, 20, 23, 26, 29]
Exotic Unicode characters above U+FFFF, those that don't fit in Java's 16-bit char type, are a bit of a nuisance. For example, Christmas tree emoji U+1F384 () is encoded using two Java "chars". For those, the function above returns the same byte index for both chars:
System.out.println(Arrays.toString(utf8ByteIndexes("x🎄y")));
Output:
[0, 1, 1, 5]
The overall cumulative byte count is correct though (the emoji takes 4 bytes if encoded in UTF-8).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With