I'm wondering what's the best way -- or if there's a simple way with the standard library -- to convert a URL with Unicode chars in the domain name and path to the equivalent ASCII URL, encoded with domain as IDNA and the path %-encoded, as per RFC 3986.
I get from the user a URL in UTF-8. So if they've typed in http://➡.ws/♥
I get 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
in Python. And what I want out is the ASCII version: 'http://xn--hgi.ws/%E2%99%A5'
.
What I do at the moment is split the URL up into parts via a regex, and then manually IDNA-encode the domain, and separately encode the path and query string with different urllib.quote()
calls.
# url is UTF-8 here, eg: url = u'http://➡.ws/㉌'.encode('utf-8') match = re.match(r'([a-z]{3,5})://(.+\.[a-z0-9]{1,6})' r'(:\d{1,5})?(/.*?)(\?.*)?$', url, flags=re.I) if not match: raise BadURLException(url) protocol, domain, port, path, query = match.groups() try: domain = unicode(domain, 'utf-8') except UnicodeDecodeError: return '' # bad UTF-8 chars in domain domain = domain.encode('idna') if port is None: port = '' path = urllib.quote(path) if query is None: query = '' else: query = urllib.quote(query, safe='=&?/') url = protocol + '://' + domain + port + path + query # url is ASCII here, eg: url = 'http://xn--hgi.ws/%E3%89%8C'
Is this correct? Any better suggestions? Is there a simple standard-library function to do this?
In summary, to convert Unicode characters into ASCII characters, use the normalize() function from the unicodedata module and the built-in encode() function for strings. You can either ignore or replace Unicode characters that do not have ASCII counterparts.
You can encode multiple parameters at once using urllib. parse. urlencode() function. This is a convenience function which takes a dictionary of key value pairs or a sequence of two-element tuples and uses the quote_plus() function to encode every value.
import urlparse, urllib def fixurl(url): # turn string into unicode if not isinstance(url,unicode): url = url.decode('utf8') # parse it parsed = urlparse.urlsplit(url) # divide the netloc further userpass,at,hostport = parsed.netloc.rpartition('@') user,colon1,pass_ = userpass.partition(':') host,colon2,port = hostport.partition(':') # encode each component scheme = parsed.scheme.encode('utf8') user = urllib.quote(user.encode('utf8')) colon1 = colon1.encode('utf8') pass_ = urllib.quote(pass_.encode('utf8')) at = at.encode('utf8') host = host.encode('idna') colon2 = colon2.encode('utf8') port = port.encode('utf8') path = '/'.join( # could be encoded slashes! urllib.quote(urllib.unquote(pce).encode('utf8'),'') for pce in parsed.path.split('/') ) query = urllib.quote(urllib.unquote(parsed.query).encode('utf8'),'=&?/') fragment = urllib.quote(urllib.unquote(parsed.fragment).encode('utf8')) # put it back together netloc = ''.join((user,colon1,pass_,at,host,colon2,port)) return urlparse.urlunsplit((scheme,netloc,path,query,fragment)) print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5') print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/%2F') print fixurl(u'http://Åsa:abc123@➡.ws:81/admin') print fixurl(u'http://➡.ws/admin')
http://xn--hgi.ws/%E2%99%A5
http://xn--hgi.ws/%E2%99%A5/%2F
http://%C3%85sa:[email protected]:81/admin
http://xn--hgi.ws/admin
urlparse
/urlunparse
to urlsplit
/urlunsplit
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With