Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can't I decode this UTF-8 page?

Howdy folks,

I'm new to getting data from the web using python. I'd like to have the source code of this page in a string: https://projects.fivethirtyeight.com/2018-nba-predictions/

The following code has worked for other pages (such as https://www.basketball-reference.com/boxscores/201712090ATL.html):

import urllib.request
file = urllib.request.urlopen(webAddress)
data = file.read()
file.close()
dataString = data.decode(encoding='UTF-8')

And I'd expect dataString to be a string of HTML (see below for my expectations in this specific case)

<!DOCTYPE html><html lang="en"><head><meta property="article:modified_time" etc etc

Instead, for the 538 website, I get this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

My research has suggested that the problem is that my file isn't actually encoded using UTF-8, but both the page's charset and beautiful-soup's UnicodeDammit() claims it's UTF-8 (the second might be because of the first). chardet.detect() doesn't suggest any encoding. I've tried substituting the following for 'UTF-8' in the encoding parameter of decode() to no avail:

ISO-8859-1

latin-1

Windows-1252

Perhaps worth mentioning is that the byte array data doesn't look like I'd expect it to. Here's data[:10] from a working URL:

b'\n<!DOCTYPE'

Here's data[:10] from the 538 site:

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

What's up?

like image 857
Andy Pollino Avatar asked Jan 30 '23 01:01

Andy Pollino


1 Answers

The server provided you with gzip-compressed data; this is not completely common, as urllib by default doesn't set any accept-encoding value, so servers generally conservatively don't compress the data.

Still, the content-encoding field of the response is set, so you have the way to know that your page is indeed gzip-compressed, and you can decompress it using Python gzip module before further processing.

import urllib.request
import gzip
file = urllib.request.urlopen(webAddress)
data = file.read()
if file.headers['content-encoding'].lower() == 'gzip':
    data = gzip.decompress(data)
file.close()
dataString = data.decode(encoding='UTF-8')

OTOH, if you have the possibility to use the requests module it will handle all this mess by itself, including compression (did I mention that you may also get deflate besides gzip, which is the same but with different headers?) and (at least partially) encoding.

import requests
webAddress = "https://projects.fivethirtyeight.com/2018-nba-predictions/"
r = requests.get(webAddress)
print(repr(r.text))

This will perform your request and correctly print out the already-decoded Unicode string.

like image 73
Matteo Italia Avatar answered Jan 31 '23 22:01

Matteo Italia