Does anyone know if the standard Java library (any version) provides a means of calculating the length of the binary encoding of a string (specifically UTF-8 in this case) without actually generating the encoded output? In other words, I'm looking for an efficient equivalent of this:
"some really long string".getBytes("UTF-8").length
I need to calculate a length prefix for potentially long serialized messages.
The length() method To calculate the length of a string in Java, you can use an inbuilt length() method of the Java string class. In Java, strings are objects created using the string class and the length() method is a public member method of this class.
UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes.
UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.
Here's an implementation based on the UTF-8 specification:
public class Utf8LenCounter { public static int length(CharSequence sequence) { int count = 0; for (int i = 0, len = sequence.length(); i < len; i++) { char ch = sequence.charAt(i); if (ch <= 0x7F) { count++; } else if (ch <= 0x7FF) { count += 2; } else if (Character.isHighSurrogate(ch)) { count += 4; ++i; } else { count += 3; } } return count; } }
This implementation is not tolerant of malformed strings.
Here's a JUnit 4 test for verification:
public class LenCounterTest { @Test public void testUtf8Len() { Charset utf8 = Charset.forName("UTF-8"); AllCodepointsIterator iterator = new AllCodepointsIterator(); while (iterator.hasNext()) { String test = new String(Character.toChars(iterator.next())); Assert.assertEquals(test.getBytes(utf8).length, Utf8LenCounter.length(test)); } } private static class AllCodepointsIterator { private static final int MAX = 0x10FFFF; //see http://unicode.org/glossary/ private static final int SURROGATE_FIRST = 0xD800; private static final int SURROGATE_LAST = 0xDFFF; private int codepoint = 0; public boolean hasNext() { return codepoint < MAX; } public int next() { int ret = codepoint; codepoint = next(codepoint); return ret; } private int next(int codepoint) { while (codepoint++ < MAX) { if (codepoint == SURROGATE_FIRST) { codepoint = SURROGATE_LAST + 1; } if (!Character.isDefined(codepoint)) { continue; } return codepoint; } return MAX; } } }
Please excuse the compact formatting.
Using Guava's Utf8:
Utf8.encodedLength("some really long string")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With