My problem is as follows. I am reading in an XML-file whose text nodes partially contain the UTF-8 version of opening and closing double quotes. The text is extracted, shortened to 3999 bytes and put into a new XML-Format, which is then saved as a file.
While both signs are displayed correctly by Notepad++ in the input file, the output file contains invalid utf-8 characters, not even Notepad++ is able to display.
The openeing double quotes are printed correctly, but the closing ones are disfigured.
Using a Hex-Editor, I found ot that the code units are somehow changed from
E2 80 9D
in the input file to
E2 80 3F
in the output file. I am using the sax-parser for the xml-parsing.
Are there any known bugs that could cause such a behaviour?
Not a known bug but a common mistake to leave encoding out when reading files or writing them - resulting in the platform default encoding used which is Windows-1252 in this case.
When you initially read the file, you should specify UTF-8 decoding and when writing to a new file, you should do specify UTF-8 encoding. If you post your implementation I can correct it in place.
How this can be reproduced:
byte[] quoteutf8 = {(byte)0xE2, (byte)0x80, (byte)0x9D};
String decodedPlatformDefault = new String(quoteutf8, "Windows-1252");
byte[] encodedPlatformDefault = decodedPlatformDefault.getBytes("Windows-1252");
for( byte i : encodedPlatformDefault ) {
System.out.print(String.format( "%02x ", i ));
//e2 80 3f
}
E2 80 9D is a valid byte sequence for UTF-8, giving '”' = \u201d'.
You can see this as all high bits are set. This is a laudable safety property of UTF, so not erroneously ASCII can be found in such a sequence, like '/'.
In the second sequence 3F ('?') has no high bit set in the byte, so is wrong. This means that the reading went wrong (question mark) or so. Like converting twice, replacing. Especially 9D is in the extended Windows Latin-1 aka Cp1252 (80 - 9F).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With