Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fetch a non-ascii url with urlopen?

I need to fetch data from a URL with non-ascii characters but urllib2.urlopen refuses to open the resource and raises:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 26: ordinal not in range(128) 

I know the URL is not standards compliant but I have no chance to change it.

What is the way to access a resource pointed by a URL containing non-ascii characters using Python?

edit: In other words, can / how urlopen open a URL like:

http://example.org/Ñöñ-ÅŞÇİİ/ 
like image 376
onurmatik Avatar asked Dec 08 '10 16:12

onurmatik


People also ask

Can URL have non ascii characters?

The URL can't contain any non-ASCII character or even a space.

Do URLs need to be ASCII?

URL Encoding (Percent Encoding) URL encoding converts characters into a format that can be transmitted over the Internet. URLs can only be sent over the Internet using the ASCII character-set. Since URLs often contain characters outside the ASCII set, the URL has to be converted into a valid ASCII format.


1 Answers

Strictly speaking URIs can't contain non-ASCII characters; what you have there is an IRI.

To convert an IRI to a plain ASCII URI:

  • non-ASCII characters in the hostname part of the address have to be encoded using the Punycode-based IDNA algorithm;

  • non-ASCII characters in the path, and most of the other parts of the address have to be encoded using UTF-8 and %-encoding, as per Ignacio's answer.

So:

import re, urlparse  def urlEncodeNonAscii(b):     return re.sub('[\x80-\xFF]', lambda c: '%%%02x' % ord(c.group(0)), b)  def iriToUri(iri):     parts= urlparse.urlparse(iri)     return urlparse.urlunparse(         part.encode('idna') if parti==1 else urlEncodeNonAscii(part.encode('utf-8'))         for parti, part in enumerate(parts)     )  >>> iriToUri(u'http://www.a\u0131b.com/a\u0131b') 'http://www.xn--ab-hpa.com/a%c4%b1b' 

(Technically this still isn't quite good enough in the general case because urlparse doesn't split away any user:pass@ prefix or :port suffix on the hostname. Only the hostname part should be IDNA encoded. It's easier to encode using normal urllib.quote and .encode('idna') at the time you're constructing a URL than to have to pull an IRI apart.)

like image 51
bobince Avatar answered Oct 08 '22 12:10

bobince