Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create a string from byte array does not return same length

I have this problem, I receive a String in a method that in database must be limited to 200(Varchar), with certain characters although the length of the String is less than 200, apparently the bytes length is more than 200, so I tried to make this:

Get the bytes length of the String

byte[] nameBytes = name.getBytes("UTF-8");

then if nameBytes.length > 200 I try to create a new String with a subarray of the original nameBytes like this:

name = new String(Arrays.copyOfRange(nameBytes, 0, 200), "UTF-8");

I am sure that Arrays.copyOfRange(nameBytes, 0, 200) is returning an array of length 200, but for some reason when I create the new String, this revision name.getBytes("UTF-8").length gives me 201, so I dont know why is adding one more byte.

There is something I am doing wrong? or there is a way to be sure o creating an array of the same length of the char array?

Thanks in advance.

like image 860
John B Avatar asked May 21 '26 19:05

John B


1 Answers

First some exemples:



        String cs;
        String name = "façade";
        byte[] nameBytes;        

        System.out.println(String.format("String '%s': %d", name, name.length()));
        cs = "UTF-8";
        nameBytes = name.getBytes(Charset.forName(cs));
        System.out.println(String.format("%s: %d / %d", cs, nameBytes.length, new String(nameBytes, cs).length()));
        cs = "UTF-16";
        nameBytes = name.getBytes(Charset.forName(cs));
        System.out.println(String.format("%s: %d / %d", cs, nameBytes.length, new String(nameBytes, cs).length()));
        cs = "UTF-16BE";
        nameBytes = name.getBytes(Charset.forName(cs));
        System.out.println(String.format("%s: %d / %d", cs, nameBytes.length, new String(nameBytes, cs).length()));

with the output:



    String 'façade': 6  ---> 6 characters with one outside ASCII range
    UTF-8: 7 / 6 ---> 'ç' requires 2 bytes, the others only one
    UTF-16: 14 / 6 ---> 2 x 6 bytes for code points + 2 bytes for BOM
    UTF-16BE: 12 / 6 ---> no need to embedded the BOM here => 2 x 6 bytes are enough

Comments:

  • always specify a charset, i.e. in both ways
  • about BOM, see Byte order mark
  • dixit Unicode Character Representations: The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities.

The issue here is about the charset used in your database. If it's UTF-8, then you would have to check character by character when you hit the 200 bytes limit. With UTF-8, you can't cut the string on an arbitrary byte number: it can be in the middle of any 2 bytes character. The result is then unpredictable.

like image 100
atao Avatar answered May 23 '26 08:05

atao



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!