I tried parsing a web page using urllib.request
's urlopen()
method, like:
from urllib.request import Request, urlopen
req = Request(url)
html = urlopen(req).read()
However, the last line returned the result in bytes.
So I tried decoding it, like:
html = urlopen(req).read().decode("utf-8")
However, the error occurred:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte.
With some research, I found one related answer, which parses charset
to decide the decode. However, the page doesn't return the charset, and when I tried checking it on Chrome Web Inspector, the following line was written in its header:
<meta charset="utf-8">
So why can I not decode it with utf-8
? And how can I parse the web page successfully?
The web site URL is http://www.vogue.com/fashion-shows/fall-2016-menswear/fendi/slideshow/collection#2
, where I want to save the image to my disk.
Note that I use Python 3.5.1. I also note that all the work I wrote above have functioned well in my other scraping programs.
The data returned by urlopen() or urlretrieve() is the raw data returned by the server. This may be binary data (such as an image), plain text or (for example) HTML. The HTTP protocol provides type information in the reply header, which can be inspected by looking at the Content-Type header.
request is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols.
This function always returns an object which can work as a context manager and has the properties url, headers, and status. See urllib.
The Python 3 standard library has a new urllib which is a merged/refactored/rewritten version of the older modules. urllib3 is a third-party package (i.e., not in CPython's standard library).
The content is compressed with gzip
. You need to decompress it:
import gzip
from urllib.request import Request, urlopen
req = Request(url)
html = gzip.decompress(urlopen(req).read()).decode('utf-8')
If you use requests
, it will uncompress automatically for you:
import requests
html = requests.get(url).text # => str, not bytes
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With