Using Python 2.5.2 and Linux Debian, I'm trying to get the content from a Spanish URL that contains a Spanish char 'í'
:
import urllib
url = u'http://mydomain.es/índice.html'
content = urllib.urlopen(url).read()
I'm getting this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 8: ordinal not in range(128)
I've tried using before passing the url to urllib this:
url = urllib.quote(url)
and this:
url = url.encode('UTF-8')
but they didn't work.
Can you tell me what I am doing wrong ?
This works for me:
#!/usr/bin/env python
# define source file encoding, see: http://www.python.org/dev/peps/pep-0263/
# -*- coding: utf-8 -*-
import urllib
url = u'http://example.com/índice.html'
content = urllib.urlopen(url.encode("UTF-8")).read()
I'm having a similar case, right now. I'm trying to download images. I retrieve the URLs from the server in a JSON file. Some of the images contain non-ASCII characters. This throws an error:
for image in product["images"]:
filename = os.path.basename(image)
filepath = product_path + "/" + filename
urllib.request.urlretrieve(image, filepath) # error!
UnicodeEncodeError: 'ascii' codec can't encode character '\xc7' in position ...
I've tried using .encode("UTF-8")
, but can't say it helped:
# coding=UTF-8
import urllib
url = u"http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = url.encode("UTF-8")
urllib.request.urlretrieve(url, "D:\image-1.jpg")
This just throws another error:
TypeError: cannot use a string pattern on a bytes-like object
Then I gave urllib.parse.quote(url)
a go:
import urllib
url = "http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = urllib.parse.quote(url)
urllib.request.urlretrieve(url, "D:\image-1.jpg")
and again, this throws another error:
ValueError: unknown url type: 'http%3A//example.com/wp-content/uploads/2018/09/%C4%B0MAGE-1.png'
The :
in "http://..."
also got escaped, and I think this is the cause of the problem.
So, I've figured out a workaround. I just quote/escape the path, not the whole URL.
import urllib.request
import urllib.parse
url = "http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = urllib.parse.urlparse(url)
url = url.scheme + "://" + url.netloc + urllib.parse.quote(url.path)
urllib.request.urlretrieve(url, "D:\image-1.jpg")
This is what the URL looks like: "http://example.com/wp-content/uploads/2018/09/%C4%B0MAGE-1.png"
, and now I can download the image.
Per the applicable standard, RFC 1378, URLs can only contain ASCII characters. Good explanation here, and I quote:
"...Only alphanumerics [0-9a-zA-Z], the special characters "$-_.+!*'()," [not including the quotes - ed], and reserved characters used for their reserved purposes may be used unencoded within a URL."
As the URLs I've given explain, this probably means you'll have to replace that "lowercase i with acute accent" with `%ED'.
Encoding the URL as utf-8, should have worked. I wonder if your source file is properly encoded, and whether the interpreter knows it. If your python source file is saved as UTF-8, for example, then you should have
# coding=UTF-8
as the first or second line.
import urllib
url = u'http://mydomain.es/índice.html'
content = urllib.urlopen(url.encode('utf-8')).read()
works for me.
Edit: also, be aware that Unicode text in an interactive Python session (whether through IDLE, or a console) is fraught with encoding-related difficulty. In those cases, you should use Unicode literals (like \u00ED in your case).
It works for me. Make sure you're using a fairly recent version of Python, and your file encoding is correct. Here's my code:
# -*- coding: utf-8 -*-
import urllib
url = u'http://mydomain.es/índice.html'
url = url.encode('utf-8')
content = urllib.urlopen(url).read()
(mydomain.es does not exist, so the DNS lookup fails, but there are no unicode issues to that point.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With