Calculating length in UTF-8 of Java String without actually encoding it

Tags:

Does anyone know if the standard Java library (any version) provides a means of calculating the length of the binary encoding of a string (specifically UTF-8 in this case) without actually generating the encoded output? In other words, I'm looking for an efficient equivalent of this:

"some really long string".getBytes("UTF-8").length

I need to calculate a length prefix for potentially long serialized messages.

784

asked Dec 14 '11 20:12

Trevor Robinson

2 Answers

Here's an implementation based on the UTF-8 specification:

public class Utf8LenCounter {   public static int length(CharSequence sequence) {     int count = 0;     for (int i = 0, len = sequence.length(); i < len; i++) {       char ch = sequence.charAt(i);       if (ch <= 0x7F) {         count++;       } else if (ch <= 0x7FF) {         count += 2;       } else if (Character.isHighSurrogate(ch)) {         count += 4;         ++i;       } else {         count += 3;       }     }     return count;   } }

This implementation is not tolerant of malformed strings.

Here's a JUnit 4 test for verification:

public class LenCounterTest {   @Test public void testUtf8Len() {     Charset utf8 = Charset.forName("UTF-8");     AllCodepointsIterator iterator = new AllCodepointsIterator();     while (iterator.hasNext()) {       String test = new String(Character.toChars(iterator.next()));       Assert.assertEquals(test.getBytes(utf8).length,                           Utf8LenCounter.length(test));     }   }    private static class AllCodepointsIterator {     private static final int MAX = 0x10FFFF; //see http://unicode.org/glossary/     private static final int SURROGATE_FIRST = 0xD800;     private static final int SURROGATE_LAST = 0xDFFF;     private int codepoint = 0;     public boolean hasNext() { return codepoint < MAX; }     public int next() {       int ret = codepoint;       codepoint = next(codepoint);       return ret;     }     private int next(int codepoint) {       while (codepoint++ < MAX) {         if (codepoint == SURROGATE_FIRST) { codepoint = SURROGATE_LAST + 1; }         if (!Character.isDefined(codepoint)) { continue; }         return codepoint;       }       return MAX;     }   } }

Please excuse the compact formatting.

answered Sep 18 '22 15:09

McDowell

Using Guava's Utf8:

Utf8.encodedLength("some really long string")

answered Sep 19 '22 15:09

Aaron Feldman

Related questions
                            
                                Java - Can final variables be initialized in static initialization block?
                            
                                What's the difference between "package" and "module"?
                            
                                What is AspectJ good for? [closed]
                            
                                How do you escape curly braces in javadoc inline tags, such as the {@code} tag
                            
                                Java Runtime Performance Vs Native C / C++ Code?
                            
                                Using Spring in a standalone application
                            
                                Is Jackson really unable to deserialize json into a generic type?
                            
                                Why is it good to close() an inputstream?
                            
                                How much memory does a string use in Java 8?
                            
                                Java: how to represent graphs?
                            
                                What is the character encoding of String in Java?
                            
                                How to get the given date string format(pattern) in java?
                            
                                Why does the equals method in String not use hash?
                            
                                Difference between Clustering and Load balancing? [closed]
                            
                                Emma coverage on Enum types
                            
                                Content is not allowed in Prolog SAXParserException
                            
                                Jackson JSON library: how to instantiate a class that contains abstract fields
                            
                                How to correctly specify a default value in the Spring @Value annotation?
                            
                                Should the mvnw files be added to the repository?
                            
                                Memory effects of synchronization in Java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculating length in UTF-8 of Java String without actually encoding it

Tags:

java

utf-8

Trevor Robinson

People also ask

2 Answers

McDowell

Aaron Feldman

Recent Activity

Donate For Us