Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to navigate to URLs with \u in them?

I have come across URLs which have \u Unicode characters within them, such as the following (note that this will not map to a valid page - it is just an example).

http://my_site_name.com/\u0442\uab86\u0454\uab8eR-\u0454\u043d-\u043c/23795908

How can I decode/encode such a URL using Python so that I can successfully do a HTTP GET to retrieve the data from this web page?

like image 699
aBlaze Avatar asked Apr 03 '18 00:04

aBlaze


2 Answers

Technically, these are not valid URLs, but they are valid IRIs (Internationalized Resource Identifiers), as defined in RFC 3987.

The way you encode an IRI to a URI is:

  • UTF-8 encode the path
  • %-encode the resulting UTF-8

For example (taken from the linked Wikipedia article), this IRI:

https://en.wiktionary.org/wiki/Ῥόδος

… maps to this URI:

https://en.wiktionary.org/wiki/%E1%BF%AC%CF%8C%CE%B4%CE%BF%CF%82

I believe requests handles these out of the box (although only pretty recently, and only "partial support" is there until 3.0, and I'm not sure what that means). I'm pretty sure urllib2 in Python2.7 doesn't, and urllib.request in Python 3.6 probably doesn't either.

At any rate, if your chosen HTTP library doesn't handle IRIs, you can do it manually:

def iri_to_uri(iri):
    p = urllib.parse.urlparse(iri)
    path = urllib.parse.quote_from_bytes(p.path.encode('utf-8'))
    p = [:2] + (path,) + p[3:]
    return urllib.parse.urlunparse(p2)

There are also a number of third-party libraries to handle IRIs, mostly spun off from other projects like Twisted and Amara. It may be worth searching PyPI for one rather than building it yourself.

Or you may want a higher-level library like hyperlink that handles all of the complicated issues in RFC 3987 (and RFC 3986, the current version of the spec for URIs—which neither requests 2.x nor the Python 3.6 stdlib handle quite right).


If you have to deal with IRIs manually, there's a good chance you also have to deal with IDNs Internationalized Domain Names in place of ASCII domain names too, even though technically they're unrelated specs. So you probably want to do something like this:

def iri_to_uri(iri):
    p = urllib.parse.urlparse(iri)
    netloc = p.netloc.encode('idna').decode('ascii')
    path = urllib.parse.quote_from_bytes(p.path.encode('utf-8'))
    p = [:1] + (netloc, path) + p[3:]
    return urllib.parse.urlunparse(p2)
like image 117
abarnert Avatar answered Nov 11 '22 04:11

abarnert


Here is an approach that automates the detection and %-encoding of non-ASCII both in the path and domain sections of IRIs:

from urllib.request import quote  

def iri_to_uri(iri):
    return ("".join([x if ord(x) < 128 else quote(x) for x in iri]))
like image 20
TavoloPerUno Avatar answered Nov 11 '22 03:11

TavoloPerUno