Get html using Python requests?

Tags:

I am trying to teach myself some basic web scraping. Using Python's requests module, I was able to grab html for various websites until I tried this:

>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')

Instead of the basic html that is the source for this page, I get:

>>> r.text '\x1f\ufffd\x08\x00\x00\x00\x00\x00\x00\x03\ufffd]o\u06f8\x12\ufffd\ufffd\ufffd+\ufffd]...  >>> r.content b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\x9d]o\xdb\xb8\x12\x86\xef\xfb+\x88]\x14h...

I have tried many combinations of get/post with every syntax I can guess from the documentation and from SO and other examples. I don't understand what I am seeing above, haven't been able to turn it into anything I can read, and can't figure out how to get what I actually want. My question is, how do I get the html for the above page?

940

asked Jan 06 '15 17:01

Rich Thompson

1 Answers

The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:

$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F HTTP/1.1 200 OK Date: Tue, 06 Jan 2015 17:46:49 GMT Server: Apache <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US"> Vary: Accept-Encoding Content-Encoding: gzip Content-Length: 3659 Content-Type: text/html

The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.

As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.

The work-around is to tell the server not to bother with compression:

headers = {'Accept-Encoding': 'identity'} r = requests.get(url, headers=headers)

and an uncompressed response is returned.

Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:

>>> pprint(dict(r.headers)) {'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',  'connection': 'Keep-Alive',  'content-encoding': 'gzip',  'content-length': '3659',  'content-type': 'text/html',  'date': 'Tue, 06 Jan 2015 17:42:06 GMT',  'keep-alive': 'timeout=5, max=100',  'server': 'Apache',  'vary': 'Accept-Encoding'}

and the content-encoding information survives, so there requests decodes the content for you, as expected.

118

answered Oct 10 '22 10:10

Martijn Pieters

Related questions
                            
                                service account does not have storage.objects.get access for Google Cloud Storage
                            
                                Best Python podcasts? [closed]
                            
                                What is the naming convention for Python class references
                            
                                Is it safe to combine 'with' and 'yield' in python?
                            
                                Is there a way to remove unused imports for Python in VS Code?
                            
                                globals and locals in python exec()
                            
                                How to clear memory completely of all matplotlib plots
                            
                                Machine Learning Algorithm for Predicting Order of Events?
                            
                                Do unused imports in Python hamper performance?
                            
                                Convert numpy array type and values from Float64 to Float32
                            
                                Python 3.7 Docker images
                            
                                What is the purpose of numpy.where returning a tuple?
                            
                                Matplotlib text dimensions
                            
                                Decorator classes in Python
                            
                                Is there a standard Python data structure that keeps things in sorted order?
                            
                                Importing all functions from a package: "from .* import *"
                            
                                How to work with HEIC image file types in Python
                            
                                Parallel Processing in python
                            
                                Understanding Django-LDAP authentication
                            
                                How to unit test Google Cloud Endpoints

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get html using Python requests?

Tags:

python

html

python-requests

Rich Thompson

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us