I have come across URLs which have \u Unicode characters within them, such as the following (note that this will not map to a valid page - it is just an example).
http://my_site_name.com/\u0442\uab86\u0454\uab8eR-\u0454\u043d-\u043c/23795908
How can I decode/encode such a URL using Python so that I can successfully do a HTTP GET to retrieve the data from this web page?
Technically, these are not valid URLs, but they are valid IRIs (Internationalized Resource Identifiers), as defined in RFC 3987.
The way you encode an IRI to a URI is:
For example (taken from the linked Wikipedia article), this IRI:
https://en.wiktionary.org/wiki/Ῥόδος
… maps to this URI:
https://en.wiktionary.org/wiki/%E1%BF%AC%CF%8C%CE%B4%CE%BF%CF%82
I believe requests
handles these out of the box (although only pretty recently, and only "partial support" is there until 3.0, and I'm not sure what that means). I'm pretty sure urllib2
in Python2.7 doesn't, and urllib.request
in Python 3.6 probably doesn't either.
At any rate, if your chosen HTTP library doesn't handle IRIs, you can do it manually:
def iri_to_uri(iri):
p = urllib.parse.urlparse(iri)
path = urllib.parse.quote_from_bytes(p.path.encode('utf-8'))
p = [:2] + (path,) + p[3:]
return urllib.parse.urlunparse(p2)
There are also a number of third-party libraries to handle IRIs, mostly spun off from other projects like Twisted and Amara. It may be worth searching PyPI for one rather than building it yourself.
Or you may want a higher-level library like hyperlink
that handles all of the complicated issues in RFC 3987 (and RFC 3986, the current version of the spec for URIs—which neither requests
2.x nor the Python 3.6 stdlib handle quite right).
If you have to deal with IRIs manually, there's a good chance you also have to deal with IDNs Internationalized Domain Names in place of ASCII domain names too, even though technically they're unrelated specs. So you probably want to do something like this:
def iri_to_uri(iri):
p = urllib.parse.urlparse(iri)
netloc = p.netloc.encode('idna').decode('ascii')
path = urllib.parse.quote_from_bytes(p.path.encode('utf-8'))
p = [:1] + (netloc, path) + p[3:]
return urllib.parse.urlunparse(p2)
Here is an approach that automates the detection and %-encoding of non-ASCII both in the path and domain sections of IRIs:
from urllib.request import quote
def iri_to_uri(iri):
return ("".join([x if ord(x) < 128 else quote(x) for x in iri]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With