I have a simple website crawler, it works fine, but sometime it stuck because of large content such as ISO images, .exe files and other large stuff. Guessing content-type using file extension is probably not the best idea.
Is it possible to get content-type and content length/size without fetching the whole content/page?
Here is my code:
requests.adapters.DEFAULT_RETRIES = 2
url = url.decode('utf8', 'ignore')
urlData = urlparse.urlparse(url)
urlDomain = urlData.netloc
session = requests.Session()
customHeaders = {}
if maxRedirects == None:
session.max_redirects = self.maxRedirects
else:
session.max_redirects = maxRedirects
self.currentUserAgent = self.userAgents[random.randrange(len(self.userAgents))]
customHeaders['User-agent'] = self.currentUserAgent
try:
response = session.get(url, timeout=self.pageOpenTimeout, headers=customHeaders)
currentUrl = response.url
currentUrlData = urlparse.urlparse(currentUrl)
currentUrlDomain = currentUrlData.netloc
domainWWW = 'www.' + str(urlDomain)
headers = response.headers
contentType = str(headers['content-type'])
except:
logging.basicConfig(level=logging.DEBUG, filename=self.exceptionsFile)
logging.exception("Get page exception:")
response = None
A status code informs you of the status of the request. For example, a 200 OK status means that your request was successful, whereas a 404 NOT FOUND status means that the resource you were looking for was not found.
The get() method sends a GET request to the specified url.
Timeouts in Python requestsYou can tell requests library to stop waiting for a response after a given amount of time by passing a number to the timeout parameter. If the requests library does not receive response in x seconds, it will raise a Timeout error.
Yes.
You can use the Session.head
method to create HEAD
requests:
response = session.head(url, timeout=self.pageOpenTimeout, headers=customHeaders)
contentType = response.headers['content-type']
A HEAD
request similar to GET
request, except that the message body would not be sent.
Here is a quote from Wikipedia:
HEAD Asks for the response identical to the one that would correspond to a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With