I read a byte[]
from a file and convert it to a String
:
byte[] bytesFromFile = Files.readAllBytes(...);
String stringFromFile = new String(bytesFromFile, "UTF-8");
I want to compare this to another byte[]
I get from a web service:
String stringFromWebService = webService.getMyByteString();
byte[] bytesFromWebService = stringFromWebService.getBytes("UTF-8");
So I read a byte[]
from a file and convert it to a String
and I get a String
from my web service and convert it to a byte[]
. Then I do the following tests:
// works!
org.junit.Assert.assertEquals(stringFromFile, stringFromWebService);
// fails!
org.junit.Assert.assertArrayEquals(bytesFromFile, bytesFromWebService);
Why does the second assertion fail?
Other answers have covered the likely fact that the file is not UTF-8
encoded giving rise to the symptoms described.
However, I think the most interesting aspect of this is not that the byte[]
assert fails, but that the assert
that the string values are the same passes. I'm not 100% sure why this is, but I think the following trawl through the source code might give us the answer:
new String(bytesFromFile, "UTF-8");
works - we see that the constructor calls through to StringCoding.decode()
UTF-8
character set, calls through to StringDecoder.decode()
CharsetDecoder.decode()
which decides what to do if the character is unmappable (which I guess will be the case if a non-UTF-8
character is presented)In this case it uses an action defined by
private CodingErrorAction unmappableCharacterAction
= CodingErrorAction.REPORT;
Which means that it still reports the character it has decoded, even though it's technically unmappable.
I think this means that even when the code gets an umappable character, it substitutes its best guess - so I'm guessing that its best guess is correct and hence the String
representations are the same under comparison, but the byte[]
are no longer the same.
This hypothesis is kind of supported by the fact that the catch
block for CharacterCodingException
in StringCoding.decode()
says:
} catch (CharacterCodingException x) {
// Substitution is always enabled,
// so this shouldn't happen
I don't understand it fully, but here's what I get so fare:
The problem is that the data contains some bytes which are not valid UTF-8 bytes as I know by the following check:
// returns false for my data!
public static boolean isValidUTF8(byte[] input) {
CharsetDecoder cs = Charset.forName("UTF-8").newDecoder();
try {
cs.decode(ByteBuffer.wrap(input));
return true;
}
catch(CharacterCodingException e){
return false;
}
}
When I change the encoding to ISO-8859-1
everything works fine. The strange thing (which a don't understand yet) is why my conversion (new String(bytesFromFile, "UTF-8");
) doesn't throw any exception (like my isValidUTF8
method), although the data is not valid UTF-8.
However, I think I will go another and encode my byte[]
in a Base64 string as I don't want more trouble with encoding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With