Howdy folks,
I'm new to getting data from the web using python. I'd like to have the source code of this page in a string: https://projects.fivethirtyeight.com/2018-nba-predictions/
The following code has worked for other pages (such as https://www.basketball-reference.com/boxscores/201712090ATL.html):
import urllib.request
file = urllib.request.urlopen(webAddress)
data = file.read()
file.close()
dataString = data.decode(encoding='UTF-8')
And I'd expect dataString to be a string of HTML (see below for my expectations in this specific case)
<!DOCTYPE html><html lang="en"><head><meta property="article:modified_time" etc etc
Instead, for the 538 website, I get this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
My research has suggested that the problem is that my file isn't actually encoded using UTF-8, but both the page's charset and beautiful-soup's UnicodeDammit() claims it's UTF-8 (the second might be because of the first). chardet.detect() doesn't suggest any encoding. I've tried substituting the following for 'UTF-8' in the encoding parameter of decode() to no avail:
ISO-8859-1
latin-1
Windows-1252
Perhaps worth mentioning is that the byte array data doesn't look like I'd expect it to. Here's data[:10] from a working URL:
b'\n<!DOCTYPE'
Here's data[:10] from the 538 site:
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'
What's up?
The server provided you with gzip-compressed data; this is not completely common, as urllib
by default doesn't set any accept-encoding
value, so servers generally conservatively don't compress the data.
Still, the content-encoding
field of the response is set, so you have the way to know that your page is indeed gzip-compressed, and you can decompress it using Python gzip
module before further processing.
import urllib.request
import gzip
file = urllib.request.urlopen(webAddress)
data = file.read()
if file.headers['content-encoding'].lower() == 'gzip':
data = gzip.decompress(data)
file.close()
dataString = data.decode(encoding='UTF-8')
OTOH, if you have the possibility to use the requests
module it will handle all this mess by itself, including compression (did I mention that you may also get deflate
besides gzip
, which is the same but with different headers?) and (at least partially) encoding.
import requests
webAddress = "https://projects.fivethirtyeight.com/2018-nba-predictions/"
r = requests.get(webAddress)
print(repr(r.text))
This will perform your request and correctly print out the already-decoded Unicode string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With