Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?

How do I truncate a java String so that I know it will fit in a given number of bytes storage once it is UTF-8 encoded?

like image 505
Johan Lübcke Avatar asked Sep 23 '08 06:09

Johan Lübcke


People also ask

How do you truncate a string in Java?

Using String's split() Method. Another way to truncate a String is to use the split() method, which uses a regular expression to split the String into pieces. The first element of results will either be our truncated String, or the original String if length was longer than text.

What is string truncation?

String Truncation. If an attempt is made to insert a string value into a table column that is too short to contain the value, the string is truncated.

How many bytes can a Java string hold?

1 Answer. Show activity on this post. Java strings can only hold 2147483647 (2^31 - 1) characters (depending on your JVM). So if your string is bigger than that, you will have to break up the string.

Does .length count spaces in Java?

Notice that spaces count in the length, but the double quotes do not. If we have escape sequences in the alphabet, then they count as one character.


2 Answers

Here is a simple loop that counts how big the UTF-8 representation is going to be, and truncates when it is exceeded:

public static String truncateWhenUTF8(String s, int maxBytes) {     int b = 0;     for (int i = 0; i < s.length(); i++) {         char c = s.charAt(i);          // ranges from http://en.wikipedia.org/wiki/UTF-8         int skip = 0;         int more;         if (c <= 0x007f) {             more = 1;         }         else if (c <= 0x07FF) {             more = 2;         } else if (c <= 0xd7ff) {             more = 3;         } else if (c <= 0xDFFF) {             // surrogate area, consume next char as well             more = 4;             skip = 1;         } else {             more = 3;         }          if (b + more > maxBytes) {             return s.substring(0, i);         }         b += more;         i += skip;     }     return s; } 

This does handle surrogate pairs that appear in the input string. Java's UTF-8 encoder (correctly) outputs surrogate pairs as a single 4-byte sequence instead of two 3-byte sequences, so truncateWhenUTF8() will return the longest truncated string it can. If you ignore surrogate pairs in the implementation then the truncated strings may be shorted than they needed to be.

I haven't done a lot of testing on that code, but here are some preliminary tests:

private static void test(String s, int maxBytes, int expectedBytes) {     String result = truncateWhenUTF8(s, maxBytes);     byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));     if (utf8.length > maxBytes) {         System.out.println("BAD: our truncation of " + s + " was too big");     }     if (utf8.length != expectedBytes) {         System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);     }     System.out.println(s + " truncated to " + result); }  public static void main(String[] args) {     test("abcd", 0, 0);     test("abcd", 1, 1);     test("abcd", 2, 2);     test("abcd", 3, 3);     test("abcd", 4, 4);     test("abcd", 5, 4);      test("a\u0080b", 0, 0);     test("a\u0080b", 1, 1);     test("a\u0080b", 2, 1);     test("a\u0080b", 3, 3);     test("a\u0080b", 4, 4);     test("a\u0080b", 5, 4);      test("a\u0800b", 0, 0);     test("a\u0800b", 1, 1);     test("a\u0800b", 2, 1);     test("a\u0800b", 3, 1);     test("a\u0800b", 4, 4);     test("a\u0800b", 5, 5);     test("a\u0800b", 6, 5);      // surrogate pairs     test("\uD834\uDD1E", 0, 0);     test("\uD834\uDD1E", 1, 0);     test("\uD834\uDD1E", 2, 0);     test("\uD834\uDD1E", 3, 0);     test("\uD834\uDD1E", 4, 4);     test("\uD834\uDD1E", 5, 4);  } 

Updated Modified code example, it now handles surrogate pairs.

like image 131
Matt Quail Avatar answered Oct 05 '22 03:10

Matt Quail


You should use CharsetEncoder, the simple getBytes() + copy as many as you can can cut UTF-8 charcters in half.

Something like this:

public static int truncateUtf8(String input, byte[] output) {          ByteBuffer outBuf = ByteBuffer.wrap(output);     CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());      CharsetEncoder utf8Enc = StandardCharsets.UTF_8.newEncoder();     utf8Enc.encode(inBuf, outBuf, true);     System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");     return outBuf.position(); } 
like image 30
mitchnull Avatar answered Oct 05 '22 04:10

mitchnull