Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is character encoding specified in a multipart/form-data HTTP POST request?

The HTML 5 specification describes an algorithm for selecting the character encoding to be used in a multi-part form submission (e.g. UTF-8). However, it is not clear how the selected encoding should be relayed to the server so that the content can be properly decoded on the receiving end.

Often, character encodings are represented by appending a "charset" parameter to the value of the Content-Type request header. However, this parameter does not appear to be defined for the multipart/form-data MIME type:

https://www.rfc-editor.org/rfc/rfc7578#section-8

Each part in a multipart form submission may provide its own Content-Type header; however, RFC 7578 notes that "in practice, many widely deployed implementations do not supply a charset parameter in each part, but rather, they rely on the notion of a 'default charset' for a multipart/form-data instance".

RFC 7578 goes on to suggest that a hidden "_charset_" form field can be used for this purpose. However, neither Safari (9.1) nor Chrome (51) appear to populate this field, nor do they provide any per-part encoding information.

I've looked at the request headers produced by both browsers and I don't see any obvious character encoding information. Does anyone know how the browsers are conveying this information to the server?

like image 215
Greg Brown Avatar asked Jun 23 '16 18:06

Greg Brown


People also ask

How is multipart form data encoded?

Multipart/form-data is one of the most used enctype/content type. In multipart, each of the field to be sent has its content type, file name and data separated by boundary from other field. No encoding of the data is necessary, because of the unique boundary. The binary data is sent as it is.

What is multipart form encoded?

enctype='multipart/form-data is an encoding type that allows files to be sent through a POST. Quite simply, without this encoding the files cannot be sent through POST. If you want to allow a user to upload a file via a form, you must use this enctype.

What is the default charset in HTTP request?

HTTP 1.1 says that the default charset is ISO-8859-1. But there are too many unlabeled documents in other encodings, so browsers use the reader's preferred encoding when there is no explicit charset parameter.

What is HTTP request multipart?

An HTTP multipart request is an HTTP request that HTTP clients construct to send files and data over to an HTTP Server. It is commonly used by browsers and HTTP clients to upload files to the server.


1 Answers

HTML 5 uses RFC 2388 (obsoleted by RFC 7578), however HTML 5 explicitly removes the Content-Type header from non-file fields, while the RFCs do not:

The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a Content-Type header specified. Their names and values must be encoded using the character encoding selected above (field names in particular do not get converted to a 7-bit safe encoding as suggested in RFC 2388).

The RFCs are designed to allow multipart/form-data to be usable in other contexts besides just HTML (though that is its most common use). In those other contexts, Content-Type is allowed. Just not in HTML 5 (but is allowed in HTML 4).

Without a Content-Type header, the hidden _charset_ form field, if present, is the only way an HTML 5 <form> submitter can explicitly state which charset is used.

Per the HTML 5 algorithm spec that you linked to, the chosen charset MUST be selected from the <form> element's accept-charset attribute if present, otherwise be the charset used by the HTML itself if it is ASCII-compatible, otherwise be UTF-8. This is explicitly stated in the algorithm spec, as well as in RFC 7578 Section 5.1.2 when referring to HTML 5.

So, there really is no need for the charset to be explicitly stated by a web browser since the receiver of the form submission should know which charset(s) to expect by virtue of how the <form> was created, and thus can check for those charset(s) while parsing the submission. If the receiver wants to know the specific charset used, it needs to include a hidden _charset_ field in the <form>.

like image 169
Remy Lebeau Avatar answered Sep 23 '22 06:09

Remy Lebeau