I have the following Java code:
byte[] signatureBytes = getSignature();
String signatureString = new String(signatureBytes, "UTF8");
byte[] signatureStringBytes = signatureString.getBytes("UTF8");
System.out.println(signatureBytes.length == signatureStringBytes.length); // prints false
Q: I'm probably misunderstanding this, but I thought that new String(byte[] bytes, String charset)
and String.getBytes(charset)
are inverse operations?
Q: As a follow up, what is a safe way to transport a byte[] array as a String?
Not every byte[]
is valid UTF-8. By default invalid sequences gets replaced by a fixed character, and I think that's the reason for such a length change.
Try Latin-1, it should not happen, as it's a simple encoding for which each byte[]
is meaningful.
Neither for Windows-1252 should it happen. There are undefined sequences there (in fact undefined bytes), but all chars get encoded in a single byte. The new byte[]
may differ from the original one, but their lengths must be the same.
I'm probably misunderstanding this, but I thought that new String(byte[] bytes, String charset) and String.getBytes(charset) are inverse operations?
Not necessarily.
If the input byte array contains sequences that are not valid UTF-8, then the initial conversion may turn them into (for example) question marks. The second operation then turns these into UTF-8 encoded '?'
characters .... different to the original representation.
It is true that some characters in Unicode have multiple representations; e.g. accented characters can be a single codepoint, or a base character codepoint and a accent codepoint. However, converting back and forth between a byte array (containing valid UTF-8) and String should preserve the codepoint sequences. It doesn't perform any "normalization".
So what would be a safe way to transport a byte[] array as String then?
The safest alternative would be base64 encode the byte array. This has the added advantage that the characters in the String will survive conversion into any character set / encoding that can represent Latin letters and digits.
Another alternative is to use Latin-1 instead of UTF-8. However:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With