What's the most efficient way to calculate the byte length of a character, taking the character encoding into account? The encoding would be only known during runtime. In UTF-8 for example the characters have a variable byte length, so each character needs to be determined individually. As far now I've come up with this:
char c = getCharSomehow();
String encoding = getEncodingSomehow();
// ...
int length = new String(new char[] { c }).getBytes(encoding).length;
But this is clumsy and inefficient in a loop since a new String
needs to be created everytime. I can't find other and more efficient ways in the Java API. There's a String#valueOf(char)
, but according its source it does basically the same as above. I imagine that this can be done with bitwise operations like bit shifting, but that's my weak point and I'm unsure how to take the encoding into account here :)
If you question the need for this, check this topic.
Update: the answer from @Bkkbrad is technically the most efficient:
char c = getCharSomehow();
String encoding = getEncodingSomehow();
CharsetEncoder encoder = Charset.forName(encoding).newEncoder();
// ...
int length = encoder.encode(CharBuffer.wrap(new char[] { c })).limit();
However as @Stephen C pointed out, there are more problems with this. There may for example be combined/surrogate characters which needs to be taken into account as well. But that's another problem which needs to be solved in the step before this step.
UTF-16 is based on 16-bit code units. Each character is encoded as at least 2 bytes.
UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.
UTF-8 is a variable-width character encoding standard that uses between one and four eight-bit bytes to represent all valid Unicode code points.
Use a CharsetEncoder and reuse a CharBuffer as input and a ByteBuffer as output.
On my system, the following code takes 25 seconds to encode 100,000 single characters:
Charset utf8 = Charset.forName("UTF-8");
char[] array = new char[1];
for (int reps = 0; reps < 10000; reps++) {
for (array[0] = 0; array[0] < 10000; array[0]++) {
int len = new String(array).getBytes(utf8).length;
}
}
However, the following code does the same thing in under 4 seconds:
Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
char[] array = new char[1];
CharBuffer input = CharBuffer.wrap(array);
ByteBuffer output = ByteBuffer.allocate(10);
for (int reps = 0; reps < 10000; reps++) {
for (array[0] = 0; array[0] < 10000; array[0]++) {
output.clear();
input.clear();
encoder.encode(input, output, false);
int len = output.position();
}
}
Edit: Why do haters gotta hate?
Here's a solution that reads from a CharBuffer and keeps track of surrogate pairs:
Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
CharBuffer input = //allocate in some way, or pass as parameter
ByteBuffer output = ByteBuffer.allocate(10);
int limit = input.limit();
while(input.position() < limit) {
output.clear();
input.mark();
input.limit(Math.max(input.position() + 2, input.capacity()));
if (Character.isHighSurrogate(input.get()) && !Character.isLowSurrogate(input.get())) {
//Malformed surrogate pair; do something!
}
input.limit(input.position());
input.reset();
encoder.encode(input, output, false);
int encodedLen = output.position();
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With