Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does the encoding's of a URL and the query string part differ?

Tags:

http

I was researching why my query parameters have plus + signs in it instead of %20 and why they have strings like %C3%BC instead of a ü (UTF-8) as an encoded URL does.

After 2 hours of thinking my webapp is not compatible to the URL encoding standard I found that the encoding scheme of a query string is not the same as the encoding of a URL (here i mean the part without the query string).

Examples:

  • URL:
    • whitespace encodes to %20
    • UTF-8 chars stays UTF-8 chars
  • Query params:
    • whitespace encodes to +
    • UTF-8 chars encodes to the hex representation

So can someone tell me why do the encoding schemes differ, since the query parameters are a part of the URL?

See:

  • wiki Percent-encoding
  • wiki: Query String
like image 454
moritz Avatar asked Mar 20 '11 00:03

moritz


2 Answers

URIs originated in RFC 1630, with percent-encoding as a method to allow "unsafe" characters to be represented. This original version actually mentioned the ISO Latin 1 character set as the encoding for non-ASCII characters. RFC 1738 later that year removed this reference to Latin-1 in defining URLs.

The query string format is actually a different but related encoding, application/x-www-form-urlencoded, defined in RFC 1866 along with HTML 2.0. It was based on RFC 1738, but specified that spaces (not all whitespace, just the character with ASCII code 0x20) are replaced by '+' and that line breaks are to be encoded as CRLF (i.e. %0D%0A). The former is likely because that saves 2 bytes for a very common character in form submissions at the expense of using an extra 2 bytes for a much less common character, and the latter is to avoid problems when transferring between systems using different end-of-line codings. Non-ASCII characters were left unconsidered.

UTF-8 coding in URIs came over a decade later, in RFC 3986, although individual protocols may have specified this or another encoding of non-ASCII characters earlier. To maintain backwards compatibility, all UTF-8 octets must be percent-encoded. The companion RFC 3987 defines "Internationalized Resource Identifiers" (IRIs) which are basically "URIs with most codepoints 160 and above allowed to appear unencoded", but many protocols still require URIs. Note that your statement above is incorrect, as a URL may not contain an unencoded ü or any other non-ASCII character.

application/x-www-form-urlencoded has been internationalized in a different manner. The HTML5 specification of application/x-www-form-urlencoded explicitly allows that any ASCII-compatible character set may be used for characters in the query string, and in fact different fields may use different character sets, but all non-ASCII octets must still be percent-encoded. When used in the query part of an IRI, it is possible that these characters could be represented unencoded if properly-normalized UTF-8 is being used as the character set, since conversion back to a URI would result in correct application/x-www-form-urlencoded data.

like image 168
Anomie Avatar answered Nov 09 '22 11:11

Anomie


They don't necessarily have to differ, a + is a valid path character and a ü is a valid search character (per RFC 3987). You're probably seeing browsers or some other preconceived encoding scheme making assumptions that are either outdated or overly cautious.

like image 1
eyelidlessness Avatar answered Nov 09 '22 10:11

eyelidlessness