I have encountered a web service that is returning an HTTP Content-Type
header with a UTF-8 charset
:
Content-Type: text/xml;charset=UTF-8
...and an XML declaration encoding
attribute whose value is ISO-8859-1 (aka, latin1):
<?xml version='1.0' encoding="ISO-8859-1" standalone="no" ?>
When I attempt to display a response from this web service in Firefox, it displays XML Parsing Error: not well-formed when it encounters an á (small letter a with acute).
The fact that Firefox issues this parsing error doesn't come as a surprise to me. I want to say that an XML encoding that is not equivalent to the HTTP character set is never correct. Am I right? Should such a situation always be considered a web server configuration problem?
You have test/xml
and UTF-8
charset. In that case, section 8.1 "Text/xml with UTF-8 Charset" of RFC 3023 applies.
<?xml version="1.0" encoding="utf-8"?>
This is the recommended charset value for use with text/xml. Since the charset parameter is provided, MIME and XML processors MUST treat the enclosed entity as UTF-8 encoded.
Unfortunately this only defines the case where the XML encoding is also utf-8
, which you don't have here.
However, there is one more section, 8.20 "Inconsistent Example: Text/xml with UTF-8 Charset", which exactly mentions the case you have:
Content-type: text/xml; charset="utf-8"
<?xml version="1.0" encoding="iso-8859-1"?>
Since the charset parameter is provided in the Content-Type header, MIME and XML processors MUST treat the enclosed entity as UTF-8 encoded. That is, the "iso-8859-1" encoding MUST be ignored.
Now, your document probably (you should verify with a hex editor) contains á in ISO-8859-1 form, which is 0xE1
(hex). Since the ISO Encoding is ignored and UTF-8 applies, this should be 0xC3 0xA1
instead.
In UTF-8, 0xE1
is not a character itself. Instead, it is the beginning of a 3 byte character sequence covering the Unicode range U+1000 to U+1FFF. To know what it would decode to, we would need to know the next 2 bytes that follow the á. It is quite likely that it is followed a "normal" character from the ASCII set. This would be an invalid character, since the 2 bytes that follow the 0xE1
must be 0x80
or higher - thus an encoding error occurs.
I want to say that an XML encoding that is not equivalent to the HTTP character set is never correct. Am I right?
Well, it's at least not recommended and you'll need to know RFC 3023 in detail in order to know what happens in such a case. It's much easier if the content type and encoding match.
Should such a situation always be considered a web server configuration problem?
No. It could also be an implementation issue, e.g. the programmer has defined the content type and encoding in the application and you can't do much against it in the webserver configuration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With