Best way to convert a Unicode URL to ASCII (UTF-8 percent-escaped) in Python?

Tags:

I'm wondering what's the best way -- or if there's a simple way with the standard library -- to convert a URL with Unicode chars in the domain name and path to the equivalent ASCII URL, encoded with domain as IDNA and the path %-encoded, as per RFC 3986.

I get from the user a URL in UTF-8. So if they've typed in http://➡.ws/♥ I get 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5' in Python. And what I want out is the ASCII version: 'http://xn--hgi.ws/%E2%99%A5'.

What I do at the moment is split the URL up into parts via a regex, and then manually IDNA-encode the domain, and separately encode the path and query string with different urllib.quote() calls.

# url is UTF-8 here, eg: url = u'http://➡.ws/㉌'.encode('utf-8') match = re.match(r'([a-z]{3,5})://(.+\.[a-z0-9]{1,6})'                  r'(:\d{1,5})?(/.*?)(\?.*)?$', url, flags=re.I) if not match:     raise BadURLException(url) protocol, domain, port, path, query = match.groups()  try:     domain = unicode(domain, 'utf-8') except UnicodeDecodeError:     return ''  # bad UTF-8 chars in domain domain = domain.encode('idna')  if port is None:     port = ''  path = urllib.quote(path)  if query is None:     query = '' else:     query = urllib.quote(query, safe='=&?/')  url = protocol + '://' + domain + port + path + query # url is ASCII here, eg: url = 'http://xn--hgi.ws/%E3%89%8C'

Is this correct? Any better suggestions? Is there a simple standard-library function to do this?

430

asked Apr 29 '09 21:04

Ben Hoyt

1 Answers

Code:

import urlparse, urllib  def fixurl(url):     # turn string into unicode     if not isinstance(url,unicode):         url = url.decode('utf8')      # parse it     parsed = urlparse.urlsplit(url)      # divide the netloc further     userpass,at,hostport = parsed.netloc.rpartition('@')     user,colon1,pass_ = userpass.partition(':')     host,colon2,port = hostport.partition(':')      # encode each component     scheme = parsed.scheme.encode('utf8')     user = urllib.quote(user.encode('utf8'))     colon1 = colon1.encode('utf8')     pass_ = urllib.quote(pass_.encode('utf8'))     at = at.encode('utf8')     host = host.encode('idna')     colon2 = colon2.encode('utf8')     port = port.encode('utf8')     path = '/'.join(  # could be encoded slashes!         urllib.quote(urllib.unquote(pce).encode('utf8'),'')         for pce in parsed.path.split('/')     )     query = urllib.quote(urllib.unquote(parsed.query).encode('utf8'),'=&?/')     fragment = urllib.quote(urllib.unquote(parsed.fragment).encode('utf8'))      # put it back together     netloc = ''.join((user,colon1,pass_,at,host,colon2,port))     return urlparse.urlunsplit((scheme,netloc,path,query,fragment))  print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5') print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/%2F') print fixurl(u'http://Åsa:abc123@➡.ws:81/admin') print fixurl(u'http://➡.ws/admin')

Output:

http://xn--hgi.ws/%E2%99%A5
http://xn--hgi.ws/%E2%99%A5/%2F
http://%C3%85sa:[email protected]:81/admin
http://xn--hgi.ws/admin

urllib.quote()
urlparse.urlparse()
urlparse.urlunparse()
urlparse.urlsplit()
urlparse.urlunsplit()

Edits:

Fixed the case of already quoted characters in the string.
Changed urlparse/urlunparse to urlsplit/urlunsplit.
Don't encode user and port information with the hostname. (Thanks Jehiah)
When "@" is missing, don't treat the host/port as user/pass! (Thanks hupf)

185

answered Sep 20 '22 12:09

Markus Jarderot

Related questions
                            
                                Django logging on Heroku
                            
                                In what situation do we need to use `multiprocessing.Pool.imap_unordered`?
                            
                                Understanding LDA implementation using gensim
                            
                                How to get only files in directory? [duplicate]
                            
                                X-Forwarded-Proto and Flask
                            
                                How to use Django's assertJSONEqual to verify response of view returning JsonResponse
                            
                                Is there a better way to guess possible unknown variables without brute force than I am doing? Machine learning? [duplicate]
                            
                                AttributeError: can't set attribute when connecting to sqlite database with flask-sqlalchemy
                            
                                How to Check if request.GET var is None?
                            
                                Get "2:35pm" instead of "02:35PM" from Python date/time?
                            
                                python subclassing multiprocessing.Process
                            
                                NoSQL Solution for Persisting Graphs at Scale
                            
                                How do I close the files from tempfile.mkstemp?
                            
                                What is the meaning of the nu parameter in Scikit-Learn's SVM class?
                            
                                How can I convert a string into a date object and get year, month and day separately?
                            
                                Is there a Python dict without values?
                            
                                Flask WTForms: Difference between DataRequired and InputRequired
                            
                                How to install the png module in python
                            
                                Running Job On Airflow Based On Webrequest
                            
                                Python: ImportError: lxml not found, please install it

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best way to convert a Unicode URL to ASCII (UTF-8 percent-escaped) in Python?

Tags:

python

url

unicode

utf-8