How do I truncate a java String
so that I know it will fit in a given number of bytes storage once it is UTF-8 encoded?
Using String's split() Method. Another way to truncate a String is to use the split() method, which uses a regular expression to split the String into pieces. The first element of results will either be our truncated String, or the original String if length was longer than text.
String Truncation. If an attempt is made to insert a string value into a table column that is too short to contain the value, the string is truncated.
1 Answer. Show activity on this post. Java strings can only hold 2147483647 (2^31 - 1) characters (depending on your JVM). So if your string is bigger than that, you will have to break up the string.
Notice that spaces count in the length, but the double quotes do not. If we have escape sequences in the alphabet, then they count as one character.
Here is a simple loop that counts how big the UTF-8 representation is going to be, and truncates when it is exceeded:
public static String truncateWhenUTF8(String s, int maxBytes) { int b = 0; for (int i = 0; i < s.length(); i++) { char c = s.charAt(i); // ranges from http://en.wikipedia.org/wiki/UTF-8 int skip = 0; int more; if (c <= 0x007f) { more = 1; } else if (c <= 0x07FF) { more = 2; } else if (c <= 0xd7ff) { more = 3; } else if (c <= 0xDFFF) { // surrogate area, consume next char as well more = 4; skip = 1; } else { more = 3; } if (b + more > maxBytes) { return s.substring(0, i); } b += more; i += skip; } return s; }
This does handle surrogate pairs that appear in the input string. Java's UTF-8 encoder (correctly) outputs surrogate pairs as a single 4-byte sequence instead of two 3-byte sequences, so truncateWhenUTF8()
will return the longest truncated string it can. If you ignore surrogate pairs in the implementation then the truncated strings may be shorted than they needed to be.
I haven't done a lot of testing on that code, but here are some preliminary tests:
private static void test(String s, int maxBytes, int expectedBytes) { String result = truncateWhenUTF8(s, maxBytes); byte[] utf8 = result.getBytes(Charset.forName("UTF-8")); if (utf8.length > maxBytes) { System.out.println("BAD: our truncation of " + s + " was too big"); } if (utf8.length != expectedBytes) { System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length); } System.out.println(s + " truncated to " + result); } public static void main(String[] args) { test("abcd", 0, 0); test("abcd", 1, 1); test("abcd", 2, 2); test("abcd", 3, 3); test("abcd", 4, 4); test("abcd", 5, 4); test("a\u0080b", 0, 0); test("a\u0080b", 1, 1); test("a\u0080b", 2, 1); test("a\u0080b", 3, 3); test("a\u0080b", 4, 4); test("a\u0080b", 5, 4); test("a\u0800b", 0, 0); test("a\u0800b", 1, 1); test("a\u0800b", 2, 1); test("a\u0800b", 3, 1); test("a\u0800b", 4, 4); test("a\u0800b", 5, 5); test("a\u0800b", 6, 5); // surrogate pairs test("\uD834\uDD1E", 0, 0); test("\uD834\uDD1E", 1, 0); test("\uD834\uDD1E", 2, 0); test("\uD834\uDD1E", 3, 0); test("\uD834\uDD1E", 4, 4); test("\uD834\uDD1E", 5, 4); }
Updated Modified code example, it now handles surrogate pairs.
You should use CharsetEncoder, the simple getBytes()
+ copy as many as you can can cut UTF-8 charcters in half.
Something like this:
public static int truncateUtf8(String input, byte[] output) { ByteBuffer outBuf = ByteBuffer.wrap(output); CharBuffer inBuf = CharBuffer.wrap(input.toCharArray()); CharsetEncoder utf8Enc = StandardCharsets.UTF_8.newEncoder(); utf8Enc.encode(inBuf, outBuf, true); System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes"); return outBuf.position(); }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With