Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mismatch between HTTP 'charset' and XML 'encoding'

I have encountered a web service that is returning an HTTP Content-Type header with a UTF-8 charset:

Content-Type: text/xml;charset=UTF-8

...and an XML declaration encoding attribute whose value is ISO-8859-1 (aka, latin1):

<?xml version='1.0' encoding="ISO-8859-1" standalone="no" ?>

When I attempt to display a response from this web service in Firefox, it displays XML Parsing Error: not well-formed when it encounters an á (small letter a with acute).

The fact that Firefox issues this parsing error doesn't come as a surprise to me. I want to say that an XML encoding that is not equivalent to the HTTP character set is never correct. Am I right? Should such a situation always be considered a web server configuration problem?

like image 287
DavidRR Avatar asked Oct 31 '14 14:10

DavidRR


1 Answers

The problem

You have test/xml and UTF-8 charset. In that case, section 8.1 "Text/xml with UTF-8 Charset" of RFC 3023 applies.

<?xml version="1.0" encoding="utf-8"?>

This is the recommended charset value for use with text/xml. Since the charset parameter is provided, MIME and XML processors MUST treat the enclosed entity as UTF-8 encoded.

Unfortunately this only defines the case where the XML encoding is also utf-8, which you don't have here.

However, there is one more section, 8.20 "Inconsistent Example: Text/xml with UTF-8 Charset", which exactly mentions the case you have:

Content-type: text/xml; charset="utf-8"

<?xml version="1.0" encoding="iso-8859-1"?>

Since the charset parameter is provided in the Content-Type header, MIME and XML processors MUST treat the enclosed entity as UTF-8 encoded. That is, the "iso-8859-1" encoding MUST be ignored.

Now, your document probably (you should verify with a hex editor) contains á in ISO-8859-1 form, which is 0xE1 (hex). Since the ISO Encoding is ignored and UTF-8 applies, this should be 0xC3 0xA1 instead.

In UTF-8, 0xE1 is not a character itself. Instead, it is the beginning of a 3 byte character sequence covering the Unicode range U+1000 to U+1FFF. To know what it would decode to, we would need to know the next 2 bytes that follow the á. It is quite likely that it is followed a "normal" character from the ASCII set. This would be an invalid character, since the 2 bytes that follow the 0xE1 must be 0x80 or higher - thus an encoding error occurs.

Your questions

I want to say that an XML encoding that is not equivalent to the HTTP character set is never correct. Am I right?

Well, it's at least not recommended and you'll need to know RFC 3023 in detail in order to know what happens in such a case. It's much easier if the content type and encoding match.

Should such a situation always be considered a web server configuration problem?

No. It could also be an implementation issue, e.g. the programmer has defined the content type and encoding in the application and you can't do much against it in the webserver configuration.

like image 66
Thomas Weller Avatar answered Sep 18 '22 16:09

Thomas Weller