Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting string to byte[] returns wrong value (encoding?)

Tags:

java

I read a byte[] from a file and convert it to a String:

byte[] bytesFromFile = Files.readAllBytes(...);
String stringFromFile = new String(bytesFromFile, "UTF-8");

I want to compare this to another byte[] I get from a web service:

String stringFromWebService = webService.getMyByteString(); 
byte[] bytesFromWebService = stringFromWebService.getBytes("UTF-8");

So I read a byte[] from a file and convert it to a String and I get a String from my web service and convert it to a byte[]. Then I do the following tests:

// works!
org.junit.Assert.assertEquals(stringFromFile, stringFromWebService);

// fails!
org.junit.Assert.assertArrayEquals(bytesFromFile, bytesFromWebService);

Why does the second assertion fail?

like image 441
Thomas Uhrig Avatar asked Mar 24 '15 16:03

Thomas Uhrig


2 Answers

Other answers have covered the likely fact that the file is not UTF-8 encoded giving rise to the symptoms described.

However, I think the most interesting aspect of this is not that the byte[] assert fails, but that the assert that the string values are the same passes. I'm not 100% sure why this is, but I think the following trawl through the source code might give us the answer:

  • Looking at how new String(bytesFromFile, "UTF-8"); works - we see that the constructor calls through to StringCoding.decode()
  • This in turn, if supplied with tht UTF-8 character set, calls through to StringDecoder.decode()
  • This calls through to CharsetDecoder.decode() which decides what to do if the character is unmappable (which I guess will be the case if a non-UTF-8 character is presented)
  • In this case it uses an action defined by

    private CodingErrorAction unmappableCharacterAction
        = CodingErrorAction.REPORT;
    
  • Which means that it still reports the character it has decoded, even though it's technically unmappable.

  • I think this means that even when the code gets an umappable character, it substitutes its best guess - so I'm guessing that its best guess is correct and hence the String representations are the same under comparison, but the byte[] are no longer the same.

This hypothesis is kind of supported by the fact that the catch block for CharacterCodingException in StringCoding.decode() says:

} catch (CharacterCodingException x) {
            // Substitution is always enabled,
            // so this shouldn't happen
like image 72
J Richard Snape Avatar answered Oct 01 '22 21:10

J Richard Snape


I don't understand it fully, but here's what I get so fare:

The problem is that the data contains some bytes which are not valid UTF-8 bytes as I know by the following check:

// returns false for my data!
public static boolean isValidUTF8(byte[] input) {
    CharsetDecoder cs = Charset.forName("UTF-8").newDecoder();
    try {
        cs.decode(ByteBuffer.wrap(input));
        return true;
    }
    catch(CharacterCodingException e){
        return false;
    }       
}

When I change the encoding to ISO-8859-1 everything works fine. The strange thing (which a don't understand yet) is why my conversion (new String(bytesFromFile, "UTF-8");) doesn't throw any exception (like my isValidUTF8 method), although the data is not valid UTF-8.

However, I think I will go another and encode my byte[] in a Base64 string as I don't want more trouble with encoding.

like image 44
Thomas Uhrig Avatar answered Oct 01 '22 19:10

Thomas Uhrig