Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to convert a Unicode URL to ASCII (UTF-8 percent-escaped) in Python?

I'm wondering what's the best way -- or if there's a simple way with the standard library -- to convert a URL with Unicode chars in the domain name and path to the equivalent ASCII URL, encoded with domain as IDNA and the path %-encoded, as per RFC 3986.

I get from the user a URL in UTF-8. So if they've typed in http://➡.ws/♥ I get 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5' in Python. And what I want out is the ASCII version: 'http://xn--hgi.ws/%E2%99%A5'.

What I do at the moment is split the URL up into parts via a regex, and then manually IDNA-encode the domain, and separately encode the path and query string with different urllib.quote() calls.

# url is UTF-8 here, eg: url = u'http://➡.ws/㉌'.encode('utf-8') match = re.match(r'([a-z]{3,5})://(.+\.[a-z0-9]{1,6})'                  r'(:\d{1,5})?(/.*?)(\?.*)?$', url, flags=re.I) if not match:     raise BadURLException(url) protocol, domain, port, path, query = match.groups()  try:     domain = unicode(domain, 'utf-8') except UnicodeDecodeError:     return ''  # bad UTF-8 chars in domain domain = domain.encode('idna')  if port is None:     port = ''  path = urllib.quote(path)  if query is None:     query = '' else:     query = urllib.quote(query, safe='=&?/')  url = protocol + '://' + domain + port + path + query # url is ASCII here, eg: url = 'http://xn--hgi.ws/%E3%89%8C' 

Is this correct? Any better suggestions? Is there a simple standard-library function to do this?

like image 430
Ben Hoyt Avatar asked Apr 29 '09 21:04

Ben Hoyt


People also ask

How do you change Unicode to ASCII in Python?

In summary, to convert Unicode characters into ASCII characters, use the normalize() function from the unicodedata module and the built-in encode() function for strings. You can either ignore or replace Unicode characters that do not have ASCII counterparts.

How do I encode a URL in Python?

You can encode multiple parameters at once using urllib. parse. urlencode() function. This is a convenience function which takes a dictionary of key value pairs or a sequence of two-element tuples and uses the quote_plus() function to encode every value.


1 Answers

Code:

import urlparse, urllib  def fixurl(url):     # turn string into unicode     if not isinstance(url,unicode):         url = url.decode('utf8')      # parse it     parsed = urlparse.urlsplit(url)      # divide the netloc further     userpass,at,hostport = parsed.netloc.rpartition('@')     user,colon1,pass_ = userpass.partition(':')     host,colon2,port = hostport.partition(':')      # encode each component     scheme = parsed.scheme.encode('utf8')     user = urllib.quote(user.encode('utf8'))     colon1 = colon1.encode('utf8')     pass_ = urllib.quote(pass_.encode('utf8'))     at = at.encode('utf8')     host = host.encode('idna')     colon2 = colon2.encode('utf8')     port = port.encode('utf8')     path = '/'.join(  # could be encoded slashes!         urllib.quote(urllib.unquote(pce).encode('utf8'),'')         for pce in parsed.path.split('/')     )     query = urllib.quote(urllib.unquote(parsed.query).encode('utf8'),'=&?/')     fragment = urllib.quote(urllib.unquote(parsed.fragment).encode('utf8'))      # put it back together     netloc = ''.join((user,colon1,pass_,at,host,colon2,port))     return urlparse.urlunsplit((scheme,netloc,path,query,fragment))  print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5') print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/%2F') print fixurl(u'http://Åsa:abc123@➡.ws:81/admin') print fixurl(u'http://➡.ws/admin') 

Output:

http://xn--hgi.ws/%E2%99%A5
http://xn--hgi.ws/%E2%99%A5/%2F
http://%C3%85sa:[email protected]:81/admin
http://xn--hgi.ws/admin

Read more:

  • urllib.quote()
  • urlparse.urlparse()
  • urlparse.urlunparse()
  • urlparse.urlsplit()
  • urlparse.urlunsplit()

Edits:

  • Fixed the case of already quoted characters in the string.
  • Changed urlparse/urlunparse to urlsplit/urlunsplit.
  • Don't encode user and port information with the hostname. (Thanks Jehiah)
  • When "@" is missing, don't treat the host/port as user/pass! (Thanks hupf)
like image 185
Markus Jarderot Avatar answered Sep 20 '22 12:09

Markus Jarderot