Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Get HTTP headers from urllib2.urlopen call?

Does urllib2 fetch the whole page when a urlopen call is made?

I'd like to just read the HTTP response header without getting the page. It looks like urllib2 opens the HTTP connection and then subsequently gets the actual HTML page... or does it just start buffering the page with the urlopen call?

import urllib2 myurl = 'http://www.kidsidebyside.org/2009/05/come-and-draw-the-circle-of-unity-with-us/' page = urllib2.urlopen(myurl) // open connection, get headers  html = page.readlines()  // stream page 
like image 406
shigeta Avatar asked May 09 '09 14:05

shigeta


People also ask

What does Urllib Urlopen return?

The problem here is that urlopen returns a reference to a file object from which you should retrieve HTML. Please note that urllib. urlopen function is marked as deprecated since python 2.6. It's recommended to use urllib2.

What is Urllib request Urlopen Python?

Urllib package is the URL handling module for python. It is used to fetch URLs (Uniform Resource Locators). It uses the urlopen function and is able to fetch URLs using a variety of different protocols. Urllib is a package that collects several modules for working with URLs, such as: urllib.


2 Answers

Use the response.info() method to get the headers.

From the urllib2 docs:

urllib2.urlopen(url[, data][, timeout])

...

This function returns a file-like object with two additional methods:

  • geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed
  • info() — return the meta-information of the page, such as headers, in the form of an httplib.HTTPMessage instance (see Quick Reference to HTTP Headers)

So, for your example, try stepping through the result of response.info().headers for what you're looking for.

Note the major caveat to using httplib.HTTPMessage is documented in python issue 4773.

like image 69
tolmeda Avatar answered Oct 10 '22 23:10

tolmeda


What about sending a HEAD request instead of a normal GET request. The following snipped (copied from a similar question) does exactly that.

>>> import httplib >>> conn = httplib.HTTPConnection("www.google.com") >>> conn.request("HEAD", "/index.html") >>> res = conn.getresponse() >>> print res.status, res.reason 200 OK >>> print res.getheaders() [('content-length', '0'), ('expires', '-1'), ('server', 'gws'), ('cache-control', 'private, max-age=0'), ('date', 'Sat, 20 Sep 2008 06:43:36 GMT'), ('content-type', 'text/html; charset=ISO-8859-1')] 
like image 37
reto Avatar answered Oct 10 '22 23:10

reto