Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Why is “~” now included in the set of reserved characters in urllib.parse.quote()?

The most recent documentation for urllib states:

Changed in version 3.7: Moved from RFC 2396 to RFC 3986 for quoting URL strings. “~” is now included in the set of reserved characters.

Why is this the case? In RFC 3986, ~ is not a reserved character:

 reserved    = gen-delims / sub-delims

 gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

 sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
             / "*" / "+" / "," / ";" / "="

Explicitly in the next section it is included as an unreserved character:

2.3. Unreserved Characters

Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.

 unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

Furthermore, later on, the RFC states that (emphasis mine):

For example, the octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations;

So it seems like 3.7 is inconsistent: it asserts the support for the newer RFC while simultaneously regressing the processing of ~. (In fact, in the older RFC, ~ is also not reserved nor 'unwise')

like image 557
cowbert Avatar asked Jul 14 '18 00:07

cowbert


1 Answers

This bug was tracked and closed in https://bugs.python.org/issue16285

And indeed, the most recent version of the code reflects the changes.

Ref https://github.com/python/cpython/blob/master/Lib/urllib/parse.py

_ALWAYS_SAFE = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
                     b'abcdefghijklmnopqrstuvwxyz'
                     b'0123456789'
                     b'_.-~')
like image 115
nosh Avatar answered Nov 15 '22 14:11

nosh