what is the default encoding one should use to decode multipart/form-data if no charset is given? RFC2388 states: <blockquote> 4.5 Charset of text in form data Each part of a multipart/form-data is supposed to have a content- type. In the case where a field element is text, the charset parameter for the text indicates the character encoding used. For example, a form with a text field in which a user typed 'Joe owes <eu>100' where <eu> is the Euro symbol might have form data returned as: <pre class="prettyprint"><code>--AaB03x content-disposition: form-data; name="field1" content-type: text/plain;charset=windows-1250 content-transfer-encoding: quoted-printable>> Joe owes =80100. --AaB03x </code></pre> </blockquote> In my case, the charset isn't set and I don't know how to decode the data within that text/plain section. As I do not want to enforce something that isn't standard behavior I'm asking what the expected behavior in this case is. The RFC does not seem to explain this so I'm kinda lost. Thank you!

This apparently has changed in HTML5 (see http://dev.w3.org/html5/spec-preview/constraints.html#multipart-form-data). <blockquote> The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a Content-Type header specified. </blockquote> So where is the character set specified? As far as I can tell from the encoding algorithm, the only place is within a form data set entry named _charset_. If your form does not have a hidden input named _charset_, what happens? I've tested this in Chrome 28, sending a form encoded in UTF-8 and one in ISO-8859-1 and inspecting the sent headers and payload, and I don't see charset given anywhere (even though the text encoding definitely changes). If I include an empty _charset_ field in the form, Chrome populates that with the correct charset type. I guess any server-side code must look for that _charset_ field to figure it out? I ran into this problem while writing a Chrome extension that uses XMLHttpRequest.send of a FormData object, which always gets encoded in UTF-8 no matter what the source document encoding is. <blockquote> Let the request entity body be the result of running the multipart/form-data encoding algorithm with data as form data set and with utf-8 as the explicit character encoding. Let mime type be the concatenation of "multipart/form-data;", a U+0020 SPACE character, "boundary=", and the multipart/form-data boundary string generated by the multipart/form-data encoding algorithm. </blockquote> As I found earlier, charset=utf-8 is not specified anywhere in the POST request, unless you include an empty _charset_ field in the form, which in this case will automatically get populated with "utf-8". This is my understanding of the state of things. I welcome any corrections to my assumptions!

multipart/form-data, what is the default charset for fields?

Tags:

standards-compliance

http

multipartform-data

rfc

what is the default encoding one should use to decode multipart/form-data if no charset is given? RFC2388 states:

4.5 Charset of text in form data

Each part of a multipart/form-data is supposed to have a content- type. In the case where a field element is text, the charset parameter for the text indicates the character encoding used.

For example, a form with a text field in which a user typed 'Joe owes <eu>100' where <eu> is the Euro symbol might have form data returned as:
--AaB03x
content-disposition: form-data; name="field1"
content-type: text/plain;charset=windows-1250
content-transfer-encoding: quoted-printable>>

Joe owes =80100.
--AaB03x

In my case, the charset isn't set and I don't know how to decode the data within that text/plain section. As I do not want to enforce something that isn't standard behavior I'm asking what the expected behavior in this case is. The RFC does not seem to explain this so I'm kinda lost.

Thank you!

972

asked Nov 03 '10 09:11

Malax

2 Answers

This apparently has changed in HTML5 (see http://dev.w3.org/html5/spec-preview/constraints.html#multipart-form-data).

The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a Content-Type header specified.

So where is the character set specified? As far as I can tell from the encoding algorithm, the only place is within a form data set entry named _charset_.

If your form does not have a hidden input named _charset_, what happens? I've tested this in Chrome 28, sending a form encoded in UTF-8 and one in ISO-8859-1 and inspecting the sent headers and payload, and I don't see charset given anywhere (even though the text encoding definitely changes). If I include an empty _charset_ field in the form, Chrome populates that with the correct charset type. I guess any server-side code must look for that _charset_ field to figure it out?

I ran into this problem while writing a Chrome extension that uses XMLHttpRequest.send of a FormData object, which always gets encoded in UTF-8 no matter what the source document encoding is.

Let the request entity body be the result of running the multipart/form-data encoding algorithm with data as form data set and with utf-8 as the explicit character encoding.

Let mime type be the concatenation of "multipart/form-data;", a U+0020 SPACE character, "boundary=", and the multipart/form-data boundary string generated by the multipart/form-data encoding algorithm.

As I found earlier, charset=utf-8 is not specified anywhere in the POST request, unless you include an empty _charset_ field in the form, which in this case will automatically get populated with "utf-8".

This is my understanding of the state of things. I welcome any corrections to my assumptions!

answered Nov 16 '22 03:11

owlman

The default charset for HTTP 1.1 is ISO-8859-1 (Latin1), I would guess that this also applies here.

3.7.1 Canonicalization and Text Defaults

--snip--

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

answered Nov 16 '22 02:11

Gareth Davidson

Related questions
                            
                                Using in-url basic authentication in firefox
                            
                                Is it necessary to append querystrings to images in an img tag and images in css to refresh cached items?
                            
                                HTTP/1.1 response to multiple range
                            
                                Apache HTTP client 4.3 credentials per request
                            
                                How to handle special characters in url as parameter values?
                            
                                Ansible: Install tarball via HTTP
                            
                                iOS9 - HTTP Connection Error
                            
                                How to display the size of a HTTP request in Fiddler?
                            
                                IE8 not displaying images (red x) ... sometimes
                            
                                $http.get() with JSON data
                            
                                Golang: HTTP deployment under Windows
                            
                                How to compare two URLs in java?
                            
                                Python requests library HTTPBasicAuth with three parameters
                            
                                How do I download a file using urllib.request in Python 3?
                            
                                HTTP Status for "already logged in"
                            
                                Cross-domain $http request AngularJS
                            
                                GuzzleHttp\Client change base url dynamically
                            
                                What does this error mean: HPE_INVALID_CONSTANT?
                            
                                Detecting the http request type (GET, HEAD, etc) from a python cgi
                            
                                can't get response header location using Java's URLConnection

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With