Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating length in UTF-8 of Java String without actually encoding it

Tags:

java

utf-8

Does anyone know if the standard Java library (any version) provides a means of calculating the length of the binary encoding of a string (specifically UTF-8 in this case) without actually generating the encoded output? In other words, I'm looking for an efficient equivalent of this:

"some really long string".getBytes("UTF-8").length 

I need to calculate a length prefix for potentially long serialized messages.

like image 784
Trevor Robinson Avatar asked Dec 14 '11 20:12

Trevor Robinson


People also ask

How do you find the length of a string in Java?

The length() method To calculate the length of a string in Java, you can use an inbuilt length() method of the Java string class. In Java, strings are objects created using the string class and the length() method is a public member method of this class.

How many bytes is a string in UTF-8?

UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes.

Does UTF-8 use 8bits?

UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.


2 Answers

Here's an implementation based on the UTF-8 specification:

public class Utf8LenCounter {   public static int length(CharSequence sequence) {     int count = 0;     for (int i = 0, len = sequence.length(); i < len; i++) {       char ch = sequence.charAt(i);       if (ch <= 0x7F) {         count++;       } else if (ch <= 0x7FF) {         count += 2;       } else if (Character.isHighSurrogate(ch)) {         count += 4;         ++i;       } else {         count += 3;       }     }     return count;   } } 

This implementation is not tolerant of malformed strings.

Here's a JUnit 4 test for verification:

public class LenCounterTest {   @Test public void testUtf8Len() {     Charset utf8 = Charset.forName("UTF-8");     AllCodepointsIterator iterator = new AllCodepointsIterator();     while (iterator.hasNext()) {       String test = new String(Character.toChars(iterator.next()));       Assert.assertEquals(test.getBytes(utf8).length,                           Utf8LenCounter.length(test));     }   }    private static class AllCodepointsIterator {     private static final int MAX = 0x10FFFF; //see http://unicode.org/glossary/     private static final int SURROGATE_FIRST = 0xD800;     private static final int SURROGATE_LAST = 0xDFFF;     private int codepoint = 0;     public boolean hasNext() { return codepoint < MAX; }     public int next() {       int ret = codepoint;       codepoint = next(codepoint);       return ret;     }     private int next(int codepoint) {       while (codepoint++ < MAX) {         if (codepoint == SURROGATE_FIRST) { codepoint = SURROGATE_LAST + 1; }         if (!Character.isDefined(codepoint)) { continue; }         return codepoint;       }       return MAX;     }   } } 

Please excuse the compact formatting.

like image 52
McDowell Avatar answered Sep 18 '22 15:09

McDowell


Using Guava's Utf8:

Utf8.encodedLength("some really long string") 
like image 43
Aaron Feldman Avatar answered Sep 19 '22 15:09

Aaron Feldman