Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Requests - get content-type/size without fetching the whole page/content

I have a simple website crawler, it works fine, but sometime it stuck because of large content such as ISO images, .exe files and other large stuff. Guessing content-type using file extension is probably not the best idea.

Is it possible to get content-type and content length/size without fetching the whole content/page?

Here is my code:

requests.adapters.DEFAULT_RETRIES = 2
url = url.decode('utf8', 'ignore')
urlData = urlparse.urlparse(url)
urlDomain = urlData.netloc
session = requests.Session()
customHeaders = {}
if maxRedirects == None:
    session.max_redirects = self.maxRedirects
else:
    session.max_redirects = maxRedirects
self.currentUserAgent = self.userAgents[random.randrange(len(self.userAgents))]
customHeaders['User-agent'] = self.currentUserAgent
try:
    response = session.get(url, timeout=self.pageOpenTimeout, headers=customHeaders)
    currentUrl = response.url
    currentUrlData = urlparse.urlparse(currentUrl)
    currentUrlDomain = currentUrlData.netloc
    domainWWW = 'www.' + str(urlDomain)
    headers = response.headers
    contentType = str(headers['content-type'])
except:
    logging.basicConfig(level=logging.DEBUG, filename=self.exceptionsFile)
    logging.exception("Get page exception:")
    response = None
like image 752
5w0rdf1sh Avatar asked May 18 '14 04:05

5w0rdf1sh


People also ask

What does response 200 mean in Python?

A status code informs you of the status of the request. For example, a 200 OK status means that your request was successful, whereas a 404 NOT FOUND status means that the resource you were looking for was not found.

What does requests get () do?

The get() method sends a GET request to the specified url.

What is timeout in requests Python?

Timeouts in Python requestsYou can tell requests library to stop waiting for a response after a given amount of time by passing a number to the timeout parameter. If the requests library does not receive response in x seconds, it will raise a Timeout error.


1 Answers

Yes.

You can use the Session.head method to create HEAD requests:

response = session.head(url, timeout=self.pageOpenTimeout, headers=customHeaders)
contentType = response.headers['content-type']

A HEAD request similar to GET request, except that the message body would not be sent.

Here is a quote from Wikipedia:

HEAD Asks for the response identical to the one that would correspond to a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.

like image 160
aIKid Avatar answered Oct 20 '22 10:10

aIKid