urllib.request.urlopen return bytes, but I cannot decode it [duplicate]

Tags:

I tried parsing a web page using urllib.request's urlopen() method, like:

from urllib.request import Request, urlopen
req = Request(url)
html = urlopen(req).read()

However, the last line returned the result in bytes.

So I tried decoding it, like:

html = urlopen(req).read().decode("utf-8")

However, the error occurred:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte.

With some research, I found one related answer, which parses charset to decide the decode. However, the page doesn't return the charset, and when I tried checking it on Chrome Web Inspector, the following line was written in its header:

<meta charset="utf-8">

So why can I not decode it with utf-8? And how can I parse the web page successfully?

The web site URL is http://www.vogue.com/fashion-shows/fall-2016-menswear/fendi/slideshow/collection#2, where I want to save the image to my disk.

Note that I use Python 3.5.1. I also note that all the work I wrote above have functioned well in my other scraping programs.

721

asked Feb 01 '16 02:02

Blaszard

1 Answers

The content is compressed with gzip. You need to decompress it:

import gzip
from urllib.request import Request, urlopen

req = Request(url)
html = gzip.decompress(urlopen(req).read()).decode('utf-8')

If you use requests, it will uncompress automatically for you:

import requests
html = requests.get(url).text  # => str, not bytes

answered Nov 15 '22 06:11

falsetru

Related questions
                            
                                Bradley-Roth Adaptive Thresholding Algorithm - How do I get better performance?
                            
                                Find duplicates with groupby in Pandas
                            
                                Crop part of np.array
                            
                                Python + requests, How I send username & password in POST?
                            
                                No module named win32com.client
                            
                                How to perform table/row locks in Django
                            
                                Argparse custom help from text file
                            
                                Cumulative counts in NumPy without iteration
                            
                                Python - decrease niceness value
                            
                                Sending piped commands via python3 subprocess
                            
                                django - filter after slice / filter on queryset where results have been limited
                            
                                Dates to Durations in Pandas
                            
                                Count items in list and make it a dictionary [duplicate]
                            
                                Cannot install pip install pyautogui, error code 1
                            
                                Celery, kombu and django - import error
                            
                                Why do imports fail in setuptools entry_point scripts, but not in python interpreter?
                            
                                Save scipy object to file
                            
                                Django How to Serialize from ManyToManyField and List All
                            
                                How can I set the time zone in Dockerfile using gliderlabs/alpine:3.3
                            
                                django.core.exceptions.ImproperlyConfigured: Enable 'django.contrib.auth.context_processors.auth'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

urllib.request.urlopen return bytes, but I cannot decode it [duplicate]

Tags:

python

python-3.x

urllib

decode

urlopen

Blaszard

People also ask

1 Answers

falsetru

Recent Activity

Donate For Us