The most recent documentation for urllib
states:
Changed in version 3.7: Moved from RFC 2396 to RFC 3986 for quoting URL strings. “~” is now included in the set of reserved characters.
Why is this the case? In RFC 3986, ~
is not a reserved character:
reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
Explicitly in the next section it is included as an unreserved character:
2.3. Unreserved Characters
Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
Furthermore, later on, the RFC states that (emphasis mine):
For example, the octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations;
So it seems like 3.7 is inconsistent: it asserts the support for the newer RFC while simultaneously regressing the processing of ~
. (In fact, in the older RFC, ~
is also not reserved nor 'unwise')
This bug was tracked and closed in https://bugs.python.org/issue16285
And indeed, the most recent version of the code reflects the changes.
Ref https://github.com/python/cpython/blob/master/Lib/urllib/parse.py
_ALWAYS_SAFE = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
b'abcdefghijklmnopqrstuvwxyz'
b'0123456789'
b'_.-~')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With