Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Truncating Strings by Bytes

I create the following for truncating a string in java to a new string with a given number of bytes.

        String truncatedValue = "";
        String currentValue = string;
        int pivotIndex = (int) Math.round(((double) string.length())/2);
        while(!truncatedValue.equals(currentValue)){
            currentValue = string.substring(0,pivotIndex);
            byte[] bytes = null;
            bytes = currentValue.getBytes(encoding);
            if(bytes==null){
                return string;
            }
            int byteLength = bytes.length;
            int newIndex =  (int) Math.round(((double) pivotIndex)/2);
            if(byteLength > maxBytesLength){
                pivotIndex = newIndex;
            } else if(byteLength < maxBytesLength){
                pivotIndex = pivotIndex + 1;
            } else {
                truncatedValue = currentValue;
            }
        }
        return truncatedValue;

This is the first thing that came to my mind, and I know I could improve on it. I saw another post that was asking a similar question there, but they were truncating Strings using the bytes instead of String.substring. I think I would rather use String.substring in my case.

EDIT: I just removed the UTF8 reference because I would rather be able to do this for different storage types aswell.

like image 531
stevebot Avatar asked Aug 26 '10 15:08

stevebot


4 Answers

The more sane solution is using decoder:

final Charset CHARSET = Charset.forName("UTF-8"); // or any other charset
final byte[] bytes = inputString.getBytes(CHARSET);
final CharsetDecoder decoder = CHARSET.newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
decoder.reset();
final CharBuffer decoded = decoder.decode(ByteBuffer.wrap(bytes, 0, limit));
final String outputString = decoded.toString();
like image 121
kan Avatar answered Nov 16 '22 02:11

kan


String s = "FOOBAR";

int limit = 3;
s = new String(s.getBytes(), 0, limit);

Result value of s:

FOO
like image 44
Ilya Lysenko Avatar answered Nov 16 '22 03:11

Ilya Lysenko


Why not convert to bytes and walk forward--obeying UTF8 character boundaries as you do it--until you've got the max number, then convert those bytes back into a string?

Or you could just cut the original string if you keep track of where the cut should occur:

// Assuming that Java will always produce valid UTF8 from a string, so no error checking!
// (Is this always true, I wonder?)
public class UTF8Cutter {
  public static String cut(String s, int n) {
    byte[] utf8 = s.getBytes();
    if (utf8.length < n) n = utf8.length;
    int n16 = 0;
    int advance = 1;
    int i = 0;
    while (i < n) {
      advance = 1;
      if ((utf8[i] & 0x80) == 0) i += 1;
      else if ((utf8[i] & 0xE0) == 0xC0) i += 2;
      else if ((utf8[i] & 0xF0) == 0xE0) i += 3;
      else { i += 4; advance = 2; }
      if (i <= n) n16 += advance;
    }
    return s.substring(0,n16);
  }
}

Note: edited to fix bugs on 2014-08-25

like image 30
Rex Kerr Avatar answered Nov 16 '22 04:11

Rex Kerr


I think Rex Kerr's solution has 2 bugs.

  • First, it will truncate to limit+1 if a non-ASCII character is just before the limit. Truncating "123456789á1" will result in "123456789á" which is represented in 11 characters in UTF-8.
  • Second, I think he misinterpreted the UTF standard. https://en.wikipedia.org/wiki/UTF-8#Description shows that a 110xxxxx at the beginning of a UTF sequence tells us that the representation is 2 characters long (as opposed to 3). That's the reason his implementation usually doesn't use up all available space (as Nissim Avitan noted).

Please find my corrected version below:

public String cut(String s, int charLimit) throws UnsupportedEncodingException {
    byte[] utf8 = s.getBytes("UTF-8");
    if (utf8.length <= charLimit) {
        return s;
    }
    int n16 = 0;
    boolean extraLong = false;
    int i = 0;
    while (i < charLimit) {
        // Unicode characters above U+FFFF need 2 words in utf16
        extraLong = ((utf8[i] & 0xF0) == 0xF0);
        if ((utf8[i] & 0x80) == 0) {
            i += 1;
        } else {
            int b = utf8[i];
            while ((b & 0x80) > 0) {
                ++i;
                b = b << 1;
            }
        }
        if (i <= charLimit) {
            n16 += (extraLong) ? 2 : 1;
        }
    }
    return s.substring(0, n16);
}

I still thought this was far from effective. So if you don't really need the String representation of the result and the byte array will do, you can use this:

private byte[] cutToBytes(String s, int charLimit) throws UnsupportedEncodingException {
    byte[] utf8 = s.getBytes("UTF-8");
    if (utf8.length <= charLimit) {
        return utf8;
    }
    if ((utf8[charLimit] & 0x80) == 0) {
        // the limit doesn't cut an UTF-8 sequence
        return Arrays.copyOf(utf8, charLimit);
    }
    int i = 0;
    while ((utf8[charLimit-i-1] & 0x80) > 0 && (utf8[charLimit-i-1] & 0x40) == 0) {
        ++i;
    }
    if ((utf8[charLimit-i-1] & 0x80) > 0) {
        // we have to skip the starter UTF-8 byte
        return Arrays.copyOf(utf8, charLimit-i-1);
    } else {
        // we passed all UTF-8 bytes
        return Arrays.copyOf(utf8, charLimit-i);
    }
}

Funny thing is that with a realistic 20-500 byte limit they perform pretty much the same IF you create a string from the byte array again.

Please note that both methods assume a valid utf-8 input which is a valid assumption after using Java's getBytes() function.

like image 35
Zsolt Taskai Avatar answered Nov 16 '22 02:11

Zsolt Taskai