Requests - get content-type/size without fetching the whole page/content

Tags:

I have a simple website crawler, it works fine, but sometime it stuck because of large content such as ISO images, .exe files and other large stuff. Guessing content-type using file extension is probably not the best idea.

Is it possible to get content-type and content length/size without fetching the whole content/page?

Here is my code:

requests.adapters.DEFAULT_RETRIES = 2
url = url.decode('utf8', 'ignore')
urlData = urlparse.urlparse(url)
urlDomain = urlData.netloc
session = requests.Session()
customHeaders = {}
if maxRedirects == None:
    session.max_redirects = self.maxRedirects
else:
    session.max_redirects = maxRedirects
self.currentUserAgent = self.userAgents[random.randrange(len(self.userAgents))]
customHeaders['User-agent'] = self.currentUserAgent
try:
    response = session.get(url, timeout=self.pageOpenTimeout, headers=customHeaders)
    currentUrl = response.url
    currentUrlData = urlparse.urlparse(currentUrl)
    currentUrlDomain = currentUrlData.netloc
    domainWWW = 'www.' + str(urlDomain)
    headers = response.headers
    contentType = str(headers['content-type'])
except:
    logging.basicConfig(level=logging.DEBUG, filename=self.exceptionsFile)
    logging.exception("Get page exception:")
    response = None

752

asked May 18 '14 04:05

5w0rdf1sh

1 Answers

Yes.

You can use the Session.head method to create HEAD requests:

response = session.head(url, timeout=self.pageOpenTimeout, headers=customHeaders)
contentType = response.headers['content-type']

A HEAD request similar to GET request, except that the message body would not be sent.

Here is a quote from Wikipedia:

HEAD Asks for the response identical to the one that would correspond to a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.

160

answered Oct 20 '22 10:10

aIKid

Related questions
                            
                                How to give a Pydantic list field a default value?
                            
                                How can I fix a JupyterLab "Code Editor out of Sync" error message?
                            
                                Pylint-django raising error about Django not being configured when that's not the case (VSCode)
                            
                                AttributeError: module 'torchtext.data' has no attribute 'Field'
                            
                                Choosing between different switch-case replacements in Python - dictionary or if-elif-else?
                            
                                How do I convert RFC822 to a python datetime object?
                            
                                How does garbage collection and scoping work in C#? [duplicate]
                            
                                Adding words to nltk stoplist
                            
                                Separating file extensions using python os.path module
                            
                                How to use os.umask() in Python
                            
                                python multiprocessing apply_async only uses one process
                            
                                GAE - AppEngine - DeadlineExceededError: Deadline exceeded while waiting for HTTP response from URL:
                            
                                Python: sorting dictionary of dictionaries
                            
                                vim-flake8 is not working
                            
                                Checking if a Django user has a password set
                            
                                How to install a missing python package from inside the script that needs it?
                            
                                PyQt4 center window on active screen
                            
                                How to deploy structured Flask app on AWS elastic beanstalk
                            
                                Show the values in the grid using matplotlib
                            
                                stack bar plot in matplotlib and add label to each section

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Requests - get content-type/size without fetching the whole page/content

Tags:

python

http-headers

content-type

python-requests

5w0rdf1sh

People also ask

1 Answers

aIKid

Recent Activity

Donate For Us