I am new to scrapy. I am writing a spider designed to check a long list of urls for the server status codes and, where appropriate, what URLs they are redirected to. Importantly, if there is a chain of redirects, I need to know the status code and url at each jump. I am using response.meta['redirect_urls'] to capture the urls, but am unsure how to capture the status codes - there doesn't seem to be a response meta key for it.
I realise I may need to write some custom middlewear to expose these values but am not quite clear how to log the status codes for every hop, nor how to access these values from the spider. I've had a look but can't find an example of anyone doing this. If anyone can point me in the right direction it would be much appreciated.
For example,
items = []
item = RedirectItem()
item['url'] = response.url
item['redirected_urls'] = response.meta['redirect_urls']
item['status_codes'] = #????
items.append(item)
Edit - Based on feedback from warawauk and some really proactive help from the guys on the IRC channel (freenode #scrappy) I've managed to do this. I believe it's a little hacky so any comments for improvement welcome:
(1) Disable the default middleware in the settings, and add your own:
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
'myproject.middlewares.CustomRedirectMiddleware': 100,
}
(2) Create your CustomRedirectMiddleware in your middlewares.py. It inherits from the main redirectmiddleware class and captures the redirect:
class CustomRedirectMiddleware(RedirectMiddleware):
"""Handle redirection of requests based on response status and meta-refresh html tag"""
def process_response(self, request, response, spider):
#Get the redirect status codes
request.meta.setdefault('redirect_status', []).append(response.status)
if 'dont_redirect' in request.meta:
return response
if request.method.upper() == 'HEAD':
if response.status in [301, 302, 303, 307] and 'Location' in response.headers:
redirected_url = urljoin(request.url, response.headers['location'])
redirected = request.replace(url=redirected_url)
return self._redirect(redirected, request, spider, response.status)
else:
return response
if response.status in [302, 303] and 'Location' in response.headers:
redirected_url = urljoin(request.url, response.headers['location'])
redirected = self._redirect_request_using_get(request, redirected_url)
return self._redirect(redirected, request, spider, response.status)
if response.status in [301, 307] and 'Location' in response.headers:
redirected_url = urljoin(request.url, response.headers['location'])
redirected = request.replace(url=redirected_url)
return self._redirect(redirected, request, spider, response.status)
if isinstance(response, HtmlResponse):
interval, url = get_meta_refresh(response)
if url and interval < self.max_metarefresh_delay:
redirected = self._redirect_request_using_get(request, url)
return self._redirect(redirected, request, spider, 'meta refresh')
return response
(3) You can now access the list of redirects in your spider with
request.meta['redirect_status']
I believe that's available as
response.status
See http://doc.scrapy.org/en/0.14/topics/request-response.html#scrapy.http.Response
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With