Efficient way to calculate byte length of a character, depending on the encoding

Tags:

What's the most efficient way to calculate the byte length of a character, taking the character encoding into account? The encoding would be only known during runtime. In UTF-8 for example the characters have a variable byte length, so each character needs to be determined individually. As far now I've come up with this:

char c = getCharSomehow();
String encoding = getEncodingSomehow();
// ...
int length = new String(new char[] { c }).getBytes(encoding).length;

But this is clumsy and inefficient in a loop since a new String needs to be created everytime. I can't find other and more efficient ways in the Java API. There's a String#valueOf(char), but according its source it does basically the same as above. I imagine that this can be done with bitwise operations like bit shifting, but that's my weak point and I'm unsure how to take the encoding into account here :)

_{If you question the need for this, check this topic.}

Update: the answer from @Bkkbrad is technically the most efficient:

char c = getCharSomehow();
String encoding = getEncodingSomehow();
CharsetEncoder encoder = Charset.forName(encoding).newEncoder();
// ...
int length = encoder.encode(CharBuffer.wrap(new char[] { c })).limit();

However as @Stephen C pointed out, there are more problems with this. There may for example be combined/surrogate characters which needs to be taken into account as well. But that's another problem which needs to be solved in the step before this step.

679

asked Apr 28 '10 00:04

BalusC

1 Answers

Use a CharsetEncoder and reuse a CharBuffer as input and a ByteBuffer as output.

On my system, the following code takes 25 seconds to encode 100,000 single characters:

Charset utf8 = Charset.forName("UTF-8");
char[] array = new char[1];
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        int len = new String(array).getBytes(utf8).length;
    }
}

However, the following code does the same thing in under 4 seconds:

Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
char[] array = new char[1];
CharBuffer input = CharBuffer.wrap(array);
ByteBuffer output = ByteBuffer.allocate(10);
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        output.clear();
        input.clear();
        encoder.encode(input, output, false);
        int len = output.position();
    }
}

Edit: Why do haters gotta hate?

Here's a solution that reads from a CharBuffer and keeps track of surrogate pairs:

Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
CharBuffer input = //allocate in some way, or pass as parameter
ByteBuffer output = ByteBuffer.allocate(10);

int limit = input.limit();
while(input.position() < limit) {
    output.clear();
    input.mark();
    input.limit(Math.max(input.position() + 2, input.capacity()));
    if (Character.isHighSurrogate(input.get()) && !Character.isLowSurrogate(input.get())) {
        //Malformed surrogate pair; do something!
    }
    input.limit(input.position());
    input.reset();
    encoder.encode(input, output, false);
    int encodedLen = output.position();
}

123

answered Sep 20 '22 22:09

Bkkbrad

Related questions
                            
                                Is there a Perl implementation in Java?
                            
                                java anonymous classes and synchronization and "this"
                            
                                Hooking a GWT event onto an element in an external iframe
                            
                                Java cipher.doFinal() writing extra bytes
                            
                                where is Enum.values() defined?
                            
                                How do I get the (Java Apache POI HSSF) Background Color for a given cell?
                            
                                Is there a helper to know whether a property has been loaded by Hibernate?
                            
                                Android - Tabs, MapView, activities within tabs
                            
                                How to read pixel color in a java BufferedImage with transparency
                            
                                Running MATLAB function from Java
                            
                                How can Java inline over virtual function boundaries?
                            
                                Using and testing web services in Eclipse
                            
                                Java type for date/time when using Oracle Date with Hibernate
                            
                                Anybody using Qi4J
                            
                                Java coding style, local variables vs repeated method calls
                            
                                Substituting Groovy for Java Little By Little
                            
                                Language recognition in Java [closed]
                            
                                Benefits and disadvantages of using java rmi
                            
                                Using WebServiceTemplate with a keystore
                            
                                How to ouput text to console from Servlet

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient way to calculate byte length of a character, depending on the encoding

Tags:

java

character-encoding

character

byte

BalusC

People also ask

1 Answers

Bkkbrad

Recent Activity

Donate For Us