In HTTP you can specify in a request that your client can accept specific content in responses using the accept
header, with values such as application/xml
. The content type specification allows you to include parameters in the content type, such as charset=utf-8
, indicating that you can accept content with a specified character set.
There is also the accept-charset
header, which specifies the character encodings which are accepted by the client.
If both headers are specified and the accept
header contains content types with the charset parameter, which should be considered the superior header by the server?
e.g.:
Accept: application/xml; q=1,
text/plain; charset=ISO-8859-1; q=0.8
Accept-Charset: UTF-8
I've sent a few example requests to various servers using Fiddler to test how they respond:
Examples
W3
Request
GET http://www.w3.org/ HTTP/1.1
Host: www.w3.org
Accept: text/html;charset=UTF-8
Accept-Charset: ISO-8859-1
Response
Content-Type: text/html; charset=utf-8
Request
GET http://www.google.co.uk/ HTTP/1.1
Host: www.google.co.uk
Accept: text/html;charset=UTF-8
Accept-Charset: ISO-8859-1
Response
Content-Type: text/html; charset=ISO-8859-1
StackOverflow
Request
GET http://stackoverflow.com/ HTTP/1.1
Host: stackoverflow.com
Accept: text/html;charset=UTF-8
Accept-Charset: ISO-8859-1
Response
Content-Type: text/html; charset=utf-8
Microsoft
Request
GET http://www.microsoft.com/ HTTP/1.1
Host: www.microsoft.com
Accept: text/html;charset=UTF-8
Accept-Charset: ISO-8859-1
Response
Content-Type: text/html
There doesn't seem to be any consensus around what the expected behaviour is. I am trying to look surprised.
The default value of the accept-charset attribute is “UNKNOWN” string which indicates the encoding equals to the encoding of the document containing the <form> element.
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
The Accept request HTTP header indicates which content types, expressed as MIME types, the client is able to understand. The server uses content negotiation to select one of the proposals and informs the client of the choice with the Content-Type response header.
UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.
Altough you can set media type in Accept
header, the charset
parameter definition for that media type is not defined anywhere in RFC 2616 (but it is not forbidden, though).
Therefore if you are going to implement a HTTP 1.1 compliant server, you shall first look for Accept-charset
header, and then search for your own parameters at Accept
header.
Read RFC 2616 Section 14.1 and 14.2. The Accept
header does not allow you to specify a charset
. You have
to use the Accept-Charset
header instead.
Firstly, Accept
headers can accept parameters, see RFC 7231 section 5.3.2
All text/*
mime-types can accept a charset parameter.
The Accept-Charset
header allows a user-agent to specify the charsets it supports.
If the Accept-Charset
header did not exist, a user-agent would have to specify each charset
parameter for each text/*
media type it accepted, e.g.
Accept: text/html;charset=US-ASCII, text/html;charset=UTF-8, text/plain;charset=US-ASCII, text/plain;charset=UTF-8
RFC 7231 section 5.3.2 (Accept
) clearly states:
Each media-range might be followed by zero or more applicable media type parameters (e.g., charset)
So a charset parameter for each content-type is allowed. In theory a client could accept, for example, text/html
only in UTF-8
and text/plain
only in US-ASCII
.
But it would usually make more sense to state possible charsets in the Accept-Charset
header as that applies to all types mentioned in the Accept
header.
If those headers’ charsets don’t overlap, the server could send status 406 Not Acceptable
.
However, I wouldn’t expect fancy cross-matching from a server for various reasons. It would make the server code more complicated (and therefore more error-prone) while in practice a client would rarely send such requests. Also nowadays I would expect everything server-side is using UTF-8 and sent as-is so there’s nothing to negotiate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With