Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get html using Python requests?

I am trying to teach myself some basic web scraping. Using Python's requests module, I was able to grab html for various websites until I tried this:

>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F') 

Instead of the basic html that is the source for this page, I get:

>>> r.text '\x1f\ufffd\x08\x00\x00\x00\x00\x00\x00\x03\ufffd]o\u06f8\x12\ufffd\ufffd\ufffd+\ufffd]...  >>> r.content b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\x9d]o\xdb\xb8\x12\x86\xef\xfb+\x88]\x14h... 

I have tried many combinations of get/post with every syntax I can guess from the documentation and from SO and other examples. I don't understand what I am seeing above, haven't been able to turn it into anything I can read, and can't figure out how to get what I actually want. My question is, how do I get the html for the above page?

like image 940
Rich Thompson Avatar asked Jan 06 '15 17:01

Rich Thompson


People also ask

How would you request a webpage using Python?

Python requests reading a web pageThe get method issues a GET request; it fetches documents identified by the given URL. The script grabs the content of the www.webcode.me web page. The get method returns a response object. The text attribute contains the content of the response, in Unicode.

What is requests HTML in Python?

Requests-HTML: HTML Parsing for Humans™ This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.


1 Answers

The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:

$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F HTTP/1.1 200 OK Date: Tue, 06 Jan 2015 17:46:49 GMT Server: Apache <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US"> Vary: Accept-Encoding Content-Encoding: gzip Content-Length: 3659 Content-Type: text/html 

The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.

As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.

The work-around is to tell the server not to bother with compression:

headers = {'Accept-Encoding': 'identity'} r = requests.get(url, headers=headers) 

and an uncompressed response is returned.

Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:

>>> pprint(dict(r.headers)) {'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',  'connection': 'Keep-Alive',  'content-encoding': 'gzip',  'content-length': '3659',  'content-type': 'text/html',  'date': 'Tue, 06 Jan 2015 17:42:06 GMT',  'keep-alive': 'timeout=5, max=100',  'server': 'Apache',  'vary': 'Accept-Encoding'} 

and the content-encoding information survives, so there requests decodes the content for you, as expected.

like image 118
Martijn Pieters Avatar answered Oct 10 '22 10:10

Martijn Pieters