Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What causes my XML to break?

I have the following XML code.

<firstname>
 <default length="6">Örwin</default>
 <short>Örwin</short>
 <shorter>Örwin</shorter>
 <shortest>�.</shortest>
</firstname>

Why does the content of the "shortest" node break? It should be a simple "Ö" instead of the tedious �. XML is UTF-8 encoded and the function which processes the output of that node also writes the content of "short" and "shorter". Where the "Ö" is clearly visible.

like image 878
individual8 Avatar asked Dec 09 '22 20:12

individual8


1 Answers

My guess is that the XML isn't properly UTF-8 encoded. Please show the bytes within the <shortest> element in the raw file... I suspect you'll find they're not a validly encoded character. If you could show a short but complete program which generates this XML from valid input, that would be very helpful. (Preferably saying which platform it is, too :)

EDIT: Something very odd is going on in this file. Here are the hex values for the "shorter" and "shortest" values:

Shorter: C3 96 72 77 69 63

Shortest: EF BF BD 2E

Now "C3 96" is the valid UTF-8 encoding for U+00D6 which is "Latin capital letter O with diaeresis" as you want.

However, EF BF BD is the UTF-8 encoding for U+FFFD which is "replacement character" - definitely not what you want. (The 2E is just the ASCII dot.)

So, this is actually valid UTF-8 - but it doesn't contain the characters you want. Again, you should examine what created the file...

like image 93
Jon Skeet Avatar answered Jan 03 '23 01:01

Jon Skeet