Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unknown UTF-8 code units closing double quotes

My problem is as follows. I am reading in an XML-file whose text nodes partially contain the UTF-8 version of opening and closing double quotes. The text is extracted, shortened to 3999 bytes and put into a new XML-Format, which is then saved as a file.

While both signs are displayed correctly by Notepad++ in the input file, the output file contains invalid utf-8 characters, not even Notepad++ is able to display.

The openeing double quotes are printed correctly, but the closing ones are disfigured.

Using a Hex-Editor, I found ot that the code units are somehow changed from

E2 80 9D

in the input file to

E2 80 3F

in the output file. I am using the sax-parser for the xml-parsing.

Are there any known bugs that could cause such a behaviour?

like image 910
LuigiEdlCarno Avatar asked Mar 25 '26 03:03

LuigiEdlCarno


2 Answers

Not a known bug but a common mistake to leave encoding out when reading files or writing them - resulting in the platform default encoding used which is Windows-1252 in this case.

When you initially read the file, you should specify UTF-8 decoding and when writing to a new file, you should do specify UTF-8 encoding. If you post your implementation I can correct it in place.

How this can be reproduced:

byte[] quoteutf8 = {(byte)0xE2, (byte)0x80, (byte)0x9D};
String decodedPlatformDefault = new String(quoteutf8, "Windows-1252");
byte[] encodedPlatformDefault = decodedPlatformDefault.getBytes("Windows-1252");

for( byte i : encodedPlatformDefault ) {
    System.out.print(String.format( "%02x ", i ));
   //e2 80 3f   
}
like image 98
Esailija Avatar answered Mar 27 '26 17:03

Esailija


E2 80 9D is a valid byte sequence for UTF-8, giving '”' = \u201d'. You can see this as all high bits are set. This is a laudable safety property of UTF, so not erroneously ASCII can be found in such a sequence, like '/'.

In the second sequence 3F ('?') has no high bit set in the byte, so is wrong. This means that the reading went wrong (question mark) or so. Like converting twice, replacing. Especially 9D is in the extended Windows Latin-1 aka Cp1252 (80 - 9F).

like image 45
Joop Eggen Avatar answered Mar 27 '26 18:03

Joop Eggen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!