The HTML 5 specification describes an algorithm for selecting the character encoding to be used in a multi-part form submission (e.g. UTF-8). However, it is not clear how the selected encoding should be relayed to the server so that the content can be properly decoded on the receiving end.
Often, character encodings are represented by appending a "charset" parameter to the value of the Content-Type
request header. However, this parameter does not appear to be defined for the multipart/form-data
MIME type:
https://www.rfc-editor.org/rfc/rfc7578#section-8
Each part in a multipart form submission may provide its own Content-Type
header; however, RFC 7578 notes that "in practice, many widely deployed implementations do not supply a charset parameter in each part, but rather, they rely on the notion of a 'default charset' for a multipart/form-data instance".
RFC 7578 goes on to suggest that a hidden "_charset_" form field can be used for this purpose. However, neither Safari (9.1) nor Chrome (51) appear to populate this field, nor do they provide any per-part encoding information.
I've looked at the request headers produced by both browsers and I don't see any obvious character encoding information. Does anyone know how the browsers are conveying this information to the server?
Multipart/form-data is one of the most used enctype/content type. In multipart, each of the field to be sent has its content type, file name and data separated by boundary from other field. No encoding of the data is necessary, because of the unique boundary. The binary data is sent as it is.
enctype='multipart/form-data is an encoding type that allows files to be sent through a POST. Quite simply, without this encoding the files cannot be sent through POST. If you want to allow a user to upload a file via a form, you must use this enctype.
HTTP 1.1 says that the default charset is ISO-8859-1. But there are too many unlabeled documents in other encodings, so browsers use the reader's preferred encoding when there is no explicit charset parameter.
An HTTP multipart request is an HTTP request that HTTP clients construct to send files and data over to an HTTP Server. It is commonly used by browsers and HTTP clients to upload files to the server.
HTML 5 uses RFC 2388 (obsoleted by RFC 7578), however HTML 5 explicitly removes the Content-Type
header from non-file fields, while the RFCs do not:
The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a
Content-Type
header specified. Their names and values must be encoded using the character encoding selected above (field names in particular do not get converted to a 7-bit safe encoding as suggested in RFC 2388).
The RFCs are designed to allow multipart/form-data
to be usable in other contexts besides just HTML (though that is its most common use). In those other contexts, Content-Type
is allowed. Just not in HTML 5 (but is allowed in HTML 4).
Without a Content-Type
header, the hidden _charset_
form field, if present, is the only way an HTML 5 <form>
submitter can explicitly state which charset is used.
Per the HTML 5 algorithm spec that you linked to, the chosen charset MUST be selected from the <form>
element's accept-charset
attribute if present, otherwise be the charset used by the HTML itself if it is ASCII-compatible, otherwise be UTF-8. This is explicitly stated in the algorithm spec, as well as in RFC 7578 Section 5.1.2 when referring to HTML 5.
So, there really is no need for the charset to be explicitly stated by a web browser since the receiver of the form submission should know which charset(s) to expect by virtue of how the <form>
was created, and thus can check for those charset(s) while parsing the submission. If the receiver wants to know the specific charset used, it needs to include a hidden _charset_
field in the <form>
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With