Unknown UTF-8 code units closing double quotes

Question

My problem is as follows. I am reading in an XML-file whose text nodes partially contain the UTF-8 version of opening and closing double quotes. The text is extracted, shortened to 3999 bytes and put into a new XML-Format, which is then saved as a file.

While both signs are displayed correctly by Notepad++ in the input file, the output file contains invalid utf-8 characters, not even Notepad++ is able to display.

The openeing double quotes are printed correctly, but the closing ones are disfigured.

Using a Hex-Editor, I found ot that the code units are somehow changed from

E2 80 9D

in the input file to

E2 80 3F

in the output file. I am using the sax-parser for the xml-parsing.

Are there any known bugs that could cause such a behaviour?

Esailija · Accepted Answer

Not a known bug but a common mistake to leave encoding out when reading files or writing them - resulting in the platform default encoding used which is Windows-1252 in this case.

When you initially read the file, you should specify UTF-8 decoding and when writing to a new file, you should do specify UTF-8 encoding. If you post your implementation I can correct it in place.

How this can be reproduced:

byte[] quoteutf8 = {(byte)0xE2, (byte)0x80, (byte)0x9D};
String decodedPlatformDefault = new String(quoteutf8, "Windows-1252");
byte[] encodedPlatformDefault = decodedPlatformDefault.getBytes("Windows-1252");

for( byte i : encodedPlatformDefault ) {
    System.out.print(String.format( "%02x ", i ));
   //e2 80 3f   
}

Joop Eggen · Answer

E2 80 9D is a valid byte sequence for UTF-8, giving '”' = \u201d'. You can see this as all high bits are set. This is a laudable safety property of UTF, so not erroneously ASCII can be found in such a sequence, like '/'.

In the second sequence 3F ('?') has no high bit set in the byte, so is wrong. This means that the reading went wrong (question mark) or so. Like converting twice, replacing. Especially 9D is in the extended Windows Latin-1 aka Cp1252 (80 - 9F).

Unknown UTF-8 code units closing double quotes

Tags:

java

xml

utf-8

saxparser

LuigiEdlCarno

2 Answers

Esailija

Joop Eggen

Recent Activity

Donate For Us

Unknown UTF-8 code units closing double quotes

Tags:

java

xml

utf-8

saxparser

LuigiEdlCarno

2 Answers

Esailija

Joop Eggen

Related questions

Recent Activity

Donate For Us