How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?

2 Answers

Here is a simple loop that counts how big the UTF-8 representation is going to be, and truncates when it is exceeded:

public static String truncateWhenUTF8(String s, int maxBytes) {     int b = 0;     for (int i = 0; i < s.length(); i++) {         char c = s.charAt(i);          // ranges from http://en.wikipedia.org/wiki/UTF-8         int skip = 0;         int more;         if (c <= 0x007f) {             more = 1;         }         else if (c <= 0x07FF) {             more = 2;         } else if (c <= 0xd7ff) {             more = 3;         } else if (c <= 0xDFFF) {             // surrogate area, consume next char as well             more = 4;             skip = 1;         } else {             more = 3;         }          if (b + more > maxBytes) {             return s.substring(0, i);         }         b += more;         i += skip;     }     return s; }

This does handle surrogate pairs that appear in the input string. Java's UTF-8 encoder (correctly) outputs surrogate pairs as a single 4-byte sequence instead of two 3-byte sequences, so truncateWhenUTF8() will return the longest truncated string it can. If you ignore surrogate pairs in the implementation then the truncated strings may be shorted than they needed to be.

I haven't done a lot of testing on that code, but here are some preliminary tests:

Click to copy

private static void test(String s, int maxBytes, int expectedBytes) {     String result = truncateWhenUTF8(s, maxBytes);     byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));     if (utf8.length > maxBytes) {         System.out.println("BAD: our truncation of " + s + " was too big");     }     if (utf8.length != expectedBytes) {         System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);     }     System.out.println(s + " truncated to " + result); }  public static void main(String[] args) {     test("abcd", 0, 0);     test("abcd", 1, 1);     test("abcd", 2, 2);     test("abcd", 3, 3);     test("abcd", 4, 4);     test("abcd", 5, 4);      test("a\u0080b", 0, 0);     test("a\u0080b", 1, 1);     test("a\u0080b", 2, 1);     test("a\u0080b", 3, 3);     test("a\u0080b", 4, 4);     test("a\u0080b", 5, 4);      test("a\u0800b", 0, 0);     test("a\u0800b", 1, 1);     test("a\u0800b", 2, 1);     test("a\u0800b", 3, 1);     test("a\u0800b", 4, 4);     test("a\u0800b", 5, 5);     test("a\u0800b", 6, 5);      // surrogate pairs     test("\uD834\uDD1E", 0, 0);     test("\uD834\uDD1E", 1, 0);     test("\uD834\uDD1E", 2, 0);     test("\uD834\uDD1E", 3, 0);     test("\uD834\uDD1E", 4, 4);     test("\uD834\uDD1E", 5, 4);  }

Updated Modified code example, it now handles surrogate pairs.

131

answered Oct 05 '22 03:10

Matt Quail

You should use CharsetEncoder, the simple getBytes() + copy as many as you can can cut UTF-8 charcters in half.

Something like this:

Click to copy

public static int truncateUtf8(String input, byte[] output) {          ByteBuffer outBuf = ByteBuffer.wrap(output);     CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());      CharsetEncoder utf8Enc = StandardCharsets.UTF_8.newEncoder();     utf8Enc.encode(inBuf, outBuf, true);     System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");     return outBuf.position(); }

answered Oct 05 '22 04:10

mitchnull

Related questions
                            
                                How to add Headers on RESTful call using Jersey Client API
                            
                                Sharing one encoder/pattern among multiple Appenders in Logback
                            
                                Why are hashCode() and getClass() native methods?
                            
                                Thymeleaf templates - Is there a way to decorate a template instead of including a template fragment?
                            
                                Java generics + Builder pattern
                            
                                What does the following Oracle error mean: invalid column index
                            
                                InvalidKeyException : Illegal Key Size - Java code throwing exception for encryption class - how to fix?
                            
                                ArrayList - How to modify a member of an object?
                            
                                What does persistence object means in Hibernate architecture?
                            
                                Missing artifact com.oracle:ojdbc6:jar:11.2.0 in pom.xml
                            
                                What does @AttributeOverride mean?
                            
                                Synchronization on the local variables
                            
                                How to retrieve Enum name using the id?
                            
                                Does spring boot support using both properties and yml files at the same time?
                            
                                How can I center Graphics.drawString() in Java?
                            
                                Jackson serializes a ZonedDateTime wrongly in Spring Boot
                            
                                Why is constructor of super class invoked when we declare the object of sub class? (Java)
                            
                                Spring Profiles on method level?
                            
                                How to create a pluginable Java program?
                            
                                Optimal number of connections in connection pool

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?

Tags:

java

string

unicode

utf-8

truncate

Johan Lübcke

People also ask

2 Answers

Matt Quail

mitchnull

Recent Activity

Donate For Us