Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why am I able to read a HEAD http request in python 3 urllib.request?

I want to make a HEAD request without any content data to conserve bandwidth. I'm using urllib.request. However, upon testing, it appears the HEAD requests also gets the data? What's going on?

Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  6 2014, 22:16:31) [MSC v.1600 64 bit (AM
D64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>> req = urllib.request.Request("http://www.google.com", method="HEAD")
>>> resp = urllib.request.urlopen(req)
>>> a = resp.read()
>>> len(a)
24088
like image 393
Eric Avatar asked Mar 29 '15 09:03

Eric


People also ask

What does Urllib request return?

This function always returns an object which can work as a context manager and has the properties url, headers, and status. See urllib.

Is Urllib request the same as request?

Requests - Requests' is a simple, easy-to-use HTTP library written in Python. 1) Python Requests encodes the parameters automatically so you just pass them as simple arguments, unlike in the case of urllib, where you need to use the method urllib. encode() to encode the parameters before passing them.

Is Urllib deprecated?

1.26. 6 (2021-06-25) Deprecated the urllib3.


1 Answers

The http://www.google.com URL redirects:

$ curl -D - -X HEAD http://www.google.com
HTTP/1.1 302 Found
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Location: http://www.google.co.uk/?gfe_rd=cr&ei=A8sXVZLOGvHH8ge1jYKwDQ
Content-Length: 261
Date: Sun, 29 Mar 2015 09:50:59 GMT
Server: GFE/2.0
Alternate-Protocol: 80:quic,p=0.5

and urllib.request has followed the redirect, issuing a GET request to that new location:

>>> import urllib.request
>>> req = urllib.request.Request("http://www.google.com", method="HEAD")
>>> resp = urllib.request.urlopen(req)
>>> resp.url
'http://www.google.co.uk/?gfe_rd=cr&ei=ucoXVdfaJOTH8gf-voKwBw'

You'd have to build your own handler stack to prevent this; the HTTPRedirectHandler isn't smart enough to not handle a redirect when issuing a HEAD method action. Adapting the example from Alan Duan from How do I prevent Python's urllib(2) from following a redirect to Python 3 would give you:

import urllib.request

class NoRedirection(urllib.request.HTTPErrorProcessor):
    def http_response(self, request, response):
        return response
    https_response = http_response

opener = urllib.request.build_opener(NoRedirection)

req = urllib.request.Request("http://www.google.com", method="HEAD")
resp = opener.open(req)

You'd be better of using the requests library; it explicitly sets allow_redirects=False when using the requests.head() or requests.Session().head() callables, so there you can see the original result:

>>> import requests
>>> requests.head('http://www.google.com')
<Response [302]>
>>> _.headers['Location']
'http://www.google.co.uk/?gfe_rd=cr&ei=FcwXVbepMvHH8ge1jYKwDQ'

and even if redirection is enabled the response.history list gives you access to the intermediate requests, and requests uses the correct method for the redirected call too:

>>> response = requests.head('http://www.google.com', allow_redirects=True)
>>> response.url
'http://www.google.co.uk/?gfe_rd=cr&ei=8e0XVYfGMubH8gfJnoKoDQ'
>>> response.history
[<Response [302]>]
>>> response.history[0].url
'http://www.google.com/'
>>> response.request.method
'HEAD'
like image 147
Martijn Pieters Avatar answered Oct 13 '22 00:10

Martijn Pieters