Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

urllib.request.urlopen return bytes, but I cannot decode it [duplicate]

I tried parsing a web page using urllib.request's urlopen() method, like:

from urllib.request import Request, urlopen
req = Request(url)
html = urlopen(req).read()

However, the last line returned the result in bytes.

So I tried decoding it, like:

html = urlopen(req).read().decode("utf-8")

However, the error occurred:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte.

With some research, I found one related answer, which parses charset to decide the decode. However, the page doesn't return the charset, and when I tried checking it on Chrome Web Inspector, the following line was written in its header:

<meta charset="utf-8">

So why can I not decode it with utf-8? And how can I parse the web page successfully?

The web site URL is http://www.vogue.com/fashion-shows/fall-2016-menswear/fendi/slideshow/collection#2, where I want to save the image to my disk.

Note that I use Python 3.5.1. I also note that all the work I wrote above have functioned well in my other scraping programs.

like image 721
Blaszard Avatar asked Feb 01 '16 02:02

Blaszard


People also ask

What does Urllib Urlopen return?

The data returned by urlopen() or urlretrieve() is the raw data returned by the server. This may be binary data (such as an image), plain text or (for example) HTML. The HTTP protocol provides type information in the reply header, which can be inspected by looking at the Content-Type header.

What does Urllib request Urlopen do?

request is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols.

What does Urllib request return?

This function always returns an object which can work as a context manager and has the properties url, headers, and status. See urllib.

Is Urllib and urllib3 same?

The Python 3 standard library has a new urllib which is a merged/refactored/rewritten version of the older modules. urllib3 is a third-party package (i.e., not in CPython's standard library).


1 Answers

The content is compressed with gzip. You need to decompress it:

import gzip
from urllib.request import Request, urlopen

req = Request(url)
html = gzip.decompress(urlopen(req).read()).decode('utf-8')

If you use requests, it will uncompress automatically for you:

import requests
html = requests.get(url).text  # => str, not bytes
like image 81
falsetru Avatar answered Nov 15 '22 06:11

falsetru