This question concerns the characters in the query string portion of the URL, which appear after the ?
mark character.
Per Wikipedia, certain characters are left as is and others are encoded (usually with a %
escape sequence).
I've been trying to track this down to actual specifications, so that I understand the justification behind every bullet point in that Wikipedia page.
Contradiction Example 1:
The HTML specification says to encode space as +
and defers the rest to RFC1738. However, this RFC says that ~
is unsafe and furthermore that "[a]ll unsafe characters must always be encoded within the URL". This seems to contradict Wikipedia.
In practice, IE8 encodes ~
in the query strings it generates, while FF3 leaves it as is.
Contradiction Example 2:
Wikipedia states that all characters that it does not mention must be encoded. !
is not mentioned in Wikipedia. But RFC1738 states that !
is a "special" character and "may be used unencoded". This seems to contradict Wikipedia which says that it must be encoded.
In practice, IE8 encodes !
in the query strings it generates, while FF3 leaves it as is.
I understand that the moral of this is probably going to be to encode those characters that are in doubt between Wikipedia and the specifications. Perhaps even going as far as encoding everything that is not [A-Za-z0-9]. I would just like to know the actual standards on this.
Conclusions
The algorithm described on Wikipedia encodes precisely those characters which are not RFC3986 unreserved characters. That is, it encodes all characters other than alphanumerics and -._~
. As a special case, space is encoded as +
instead of %20
per RFC3986.
Some applications use an older RFC. For comparison, the RFC2396 unreserved characters are alphanumerics and !'()*-._~
.
For comparison, the HTML5 working draft algorithm encodes all characters other than alphanumerics and *-._
. The special case encoding for space remains +
. Notable differences are that *
is not encoded and ~
is encoded. (Technically, this handling of *
is compatible with RFC3986 even though *
is in reserved
because it is in the sub-delims
which are allowed in the query
production.)
You can represent any member of the execution character set by an escape sequence. They are primarily used to put nonprintable characters in character and string literals. For example, you can use escape sequences to put such characters as tab, carriage return, and backspace into an output stream.
The query component is a string of information to be interpreted by the resource. Within a query component, the characters ";", "/", "?", ":", "@", "&", "=", "+", ",", and "$" are reserved.
If you must escape a character in a string literal, you must use the dollar sign ($) instead of percent (%); for example, use query=title%20EQ%20"$3CMy title$3E" instead of query=title%20EQ%20'%3CMy title%3E' .
The answer lies in the RFC 3986 document, specifically Section 3.4.
The query component is indicated by the first question mark ("?") character and terminated by a number sign ("#") character or by the end of the URI.
...
The characters slash ("/") and question mark ("?") may represent data within the query component.
Technically, RFC 3986-3.4 defines the query component as:
query = *( pchar / "/" / "?" )
This syntax means that query can include all characters from pchar
as well as /
and ?
. pchar
refers to another specification of path characters. Helpfully, Appendix A of RFC 3986 lists the relevant ABNF definitions, most notably:
query = *( pchar / "/" / "?" ) pchar = unreserved / pct-encoded / sub-delims / ":" / "@" unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" pct-encoded = "%" HEXDIG HEXDIG sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
Thus, in addition to all alphanumerics and percent encoded characters, a query can legally include the following unencoded characters:
/ ? : @ - . _ ~ ! $ & ' ( ) * + , ; =
Of course, you may want to keep in mind that '=' and '&' usually have special significance within a query.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With