# -*- coding: utf-8 -*-
# Python3
import urllib
import urllib.request as url_req
opener = url_req.build_opener()
url='http://zh.wikipedia.org/wiki/'+"毛泽东"
opener.open(url).read()
# opener.open(url.encode("utf-8")).read()
# # doesn't work either
When I run it, it complains that:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-12: ordinal not in range(128)
But I can't use .encode()
either as it will complain:
Traceback (most recent call last):
File "t.py", line 8, in <module>
opener.open(url.encode("utf-8")).read()
File "/usr/local/Cellar/python3/3.2.2/lib/python3.2/urllib/request.py", line 360, in open
req.timeout = timeout
AttributeError: 'bytes' object has no attribute 'timeout'
Anyone knows how to deal with that?
Unicode contains many characters that have similar appearance to other characters. Allowing the full range of Unicode into a URL means that characters which look similar—or even identical to—other characters could be used to spoof users.
In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.
In Python3, the default string is called Unicode string (u string), you can understand them as human-readable characters. As explained above, you can encode them to the byte string (b string), and the byte string can be decoded back to the Unicode string.
You could use urllib.parse.quote() to encode the path section of URL.
#!/usr/bin/env python3
from urllib.parse import quote
from urllib.request import urlopen
url = 'http://zh.wikipedia.org/wiki/' + quote("毛泽东")
content = urlopen(url).read()
The fantastic requests library does this for you out of the box:
>>> url='http://zh.wikipedia.org/wiki/'+"毛泽东"
>>> import requests
>>> r = requests.get(url)
>>> len(r.content)
818747
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With