Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how can I get complete header info from urlib2 request?

I am using the python urllib2 library for opening URL, and what I want is to get the complete header info of the request. When I use response.info I only get this:

Date: Mon, 15 Aug 2011 12:00:42 GMT
Server: Apache/2.2.0 (Unix)
Last-Modified: Tue, 01 May 2001 18:40:33 GMT
ETag: "13ef600-141-897e4a40"
Accept-Ranges: bytes
Content-Length: 321
Connection: close
Content-Type: text/html

I am expecting the complete info as given by live_http_headers (add-on for firefox), e.g:

http://www.yellowpages.com.mt/Malta-Web/127151.aspx

GET /Malta-Web/127151.aspx HTTP/1.1
Host: www.yellowpages.com.mt
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Cookie: __utma=156587571.1883941323.1313405289.1313405289.1313405289.1;    __utmz=156587571.1313405289.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)

HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 141
Date: Mon, 15 Aug 2011 12:17:25 GMT
Location: http://www.trucks.com.mt
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET, UrlRewriter.NET 2.0.0
X-AspNet-Version: 2.0.50727
Set-Cookie: ASP.NET_SessionId=zhnqh5554omyti55dxbvmf55; path=/; HttpOnly
Cache-Control: private

My request function is:

def dorequest(url, post=None, headers={}):
    cOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookielib.CookieJar()))
    urllib2.install_opener( cOpener )
    if post:
        post = urllib.urlencode(post)
    req = urllib2.Request(url, post, headers)
    response   = cOpener.open(req)
    print response.info()  // this does not give complete header info, how can i get complete header info??
    return response.read()
 url = 'http://www.yellowpages.com.mt/Malta-Web/127151.aspx'
 html = dorequest(url)

Is it possible to achieve the desired header info details by using urllib2? I don't want to switch to httplib.

like image 492
Aamir Rind Avatar asked Aug 15 '11 12:08

Aamir Rind


People also ask

How do I get headers in Urllib?

Use the response.info() method to get the headers. This function returns a file-like object with two additional methods: geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed.

How do I get the HTTP header in Python?

In order to pass HTTP headers into a POST request using the Python requests library, you can use the headers= parameter in the . post() function. The headers= parameter accepts a Python dictionary of key-value pairs, where the key represents the header type and the value is the header value.

What does Urlopen return?

The data returned by urlopen() or urlretrieve() is the raw data returned by the server. This may be binary data (such as an image), plain text or (for example) HTML. The HTTP protocol provides type information in the reply header, which can be inspected by looking at the Content-Type header.


1 Answers

Those are all of the headers the server is sending when you do the request with urllib2.

Firefox is showing you the headers it's sending to the server as well.

When the server gets those headers from Firefox, some of them may trigger it to send back additional headers, so you end up with more response headers as well.

Duplicate the exact headers Firefox sends, and you'll get back an identical response.

Edit: That location header is sent by the page that does the redirect, not the page you're redirected to. Just use response.url to get the location of the page you've been sent to.

That first URL uses a 302 redirect. If you don't want to follow the redirect, but see the headers from the first page instead, use a URLOpener instead of a FancyURLOpener, which automatically follows redirects.

like image 181
agf Avatar answered Sep 30 '22 21:09

agf