Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multipart/form-data, what is the default charset for fields?

what is the default encoding one should use to decode multipart/form-data if no charset is given? RFC2388 states:

4.5 Charset of text in form data

Each part of a multipart/form-data is supposed to have a content- type. In the case where a field element is text, the charset parameter for the text indicates the character encoding used.

For example, a form with a text field in which a user typed 'Joe owes <eu>100' where <eu> is the Euro symbol might have form data returned as:

--AaB03x
content-disposition: form-data; name="field1"
content-type: text/plain;charset=windows-1250
content-transfer-encoding: quoted-printable>>

Joe owes =80100.
--AaB03x

In my case, the charset isn't set and I don't know how to decode the data within that text/plain section. As I do not want to enforce something that isn't standard behavior I'm asking what the expected behavior in this case is. The RFC does not seem to explain this so I'm kinda lost.

Thank you!

like image 972
Malax Avatar asked Nov 03 '10 09:11

Malax


People also ask

What is the format of multipart form data?

A multipart/form-data request body contains a series of parts separated by a boundary delimiter, constructed using Carriage Return Line Feed (CRLF), "--", and also the value of the boundary parameters. The boundary delimiter must not appear inside any of the encapsulated parts.

How is multipart form data encoded?

Multipart/form-data is one of the most used enctype/content type. In multipart, each of the field to be sent has its content type, file name and data separated by boundary from other field. No encoding of the data is necessary, because of the unique boundary. The binary data is sent as it is.

What format is form data?

A Forms Data Format (FDF) file is a formatted ASCII file which contains key:value pairs defining the field names and associated values that are used to populate a form. Forms Data Format (FDF) is defined in the international stand for Portable Document Format; ISO-32000.

What is multipart form data in REST API?

Multipart requests combine one or more sets of data into a single body, separated by boundaries. You typically use these requests for file uploads and for transferring data of several types in a single request (for example, a file along with a JSON object).


2 Answers

This apparently has changed in HTML5 (see http://dev.w3.org/html5/spec-preview/constraints.html#multipart-form-data).

The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a Content-Type header specified.

So where is the character set specified? As far as I can tell from the encoding algorithm, the only place is within a form data set entry named _charset_.

If your form does not have a hidden input named _charset_, what happens? I've tested this in Chrome 28, sending a form encoded in UTF-8 and one in ISO-8859-1 and inspecting the sent headers and payload, and I don't see charset given anywhere (even though the text encoding definitely changes). If I include an empty _charset_ field in the form, Chrome populates that with the correct charset type. I guess any server-side code must look for that _charset_ field to figure it out?

I ran into this problem while writing a Chrome extension that uses XMLHttpRequest.send of a FormData object, which always gets encoded in UTF-8 no matter what the source document encoding is.

Let the request entity body be the result of running the multipart/form-data encoding algorithm with data as form data set and with utf-8 as the explicit character encoding.

Let mime type be the concatenation of "multipart/form-data;", a U+0020 SPACE character, "boundary=", and the multipart/form-data boundary string generated by the multipart/form-data encoding algorithm.

As I found earlier, charset=utf-8 is not specified anywhere in the POST request, unless you include an empty _charset_ field in the form, which in this case will automatically get populated with "utf-8".

This is my understanding of the state of things. I welcome any corrections to my assumptions!

like image 85
owlman Avatar answered Nov 16 '22 03:11

owlman


The default charset for HTTP 1.1 is ISO-8859-1 (Latin1), I would guess that this also applies here.

3.7.1 Canonicalization and Text Defaults

--snip--

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

like image 41
Gareth Davidson Avatar answered Nov 16 '22 02:11

Gareth Davidson