Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capturing http status codes with scrapy spider

I am new to scrapy. I am writing a spider designed to check a long list of urls for the server status codes and, where appropriate, what URLs they are redirected to. Importantly, if there is a chain of redirects, I need to know the status code and url at each jump. I am using response.meta['redirect_urls'] to capture the urls, but am unsure how to capture the status codes - there doesn't seem to be a response meta key for it.

I realise I may need to write some custom middlewear to expose these values but am not quite clear how to log the status codes for every hop, nor how to access these values from the spider. I've had a look but can't find an example of anyone doing this. If anyone can point me in the right direction it would be much appreciated.

For example,

    items = []
    item = RedirectItem()
    item['url'] = response.url
    item['redirected_urls'] = response.meta['redirect_urls']     
    item['status_codes'] = #????
    items.append(item)

Edit - Based on feedback from warawauk and some really proactive help from the guys on the IRC channel (freenode #scrappy) I've managed to do this. I believe it's a little hacky so any comments for improvement welcome:

(1) Disable the default middleware in the settings, and add your own:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
    'myproject.middlewares.CustomRedirectMiddleware': 100,
}

(2) Create your CustomRedirectMiddleware in your middlewares.py. It inherits from the main redirectmiddleware class and captures the redirect:

class CustomRedirectMiddleware(RedirectMiddleware):
    """Handle redirection of requests based on response status and meta-refresh html tag"""

    def process_response(self, request, response, spider):
        #Get the redirect status codes
        request.meta.setdefault('redirect_status', []).append(response.status)
        if 'dont_redirect' in request.meta:
            return response
        if request.method.upper() == 'HEAD':
            if response.status in [301, 302, 303, 307] and 'Location' in response.headers:
                redirected_url = urljoin(request.url, response.headers['location'])
                redirected = request.replace(url=redirected_url)

                return self._redirect(redirected, request, spider, response.status)
            else:
                return response

        if response.status in [302, 303] and 'Location' in response.headers:
            redirected_url = urljoin(request.url, response.headers['location'])
            redirected = self._redirect_request_using_get(request, redirected_url)
            return self._redirect(redirected, request, spider, response.status)

        if response.status in [301, 307] and 'Location' in response.headers:
            redirected_url = urljoin(request.url, response.headers['location'])
            redirected = request.replace(url=redirected_url)
            return self._redirect(redirected, request, spider, response.status)

        if isinstance(response, HtmlResponse):
            interval, url = get_meta_refresh(response)
            if url and interval < self.max_metarefresh_delay:
                redirected = self._redirect_request_using_get(request, url)
                return self._redirect(redirected, request, spider, 'meta refresh')


        return response

(3) You can now access the list of redirects in your spider with

request.meta['redirect_status']
like image 459
reportingmonkey Avatar asked Jun 11 '12 14:06

reportingmonkey


1 Answers

I believe that's available as

response.status

See http://doc.scrapy.org/en/0.14/topics/request-response.html#scrapy.http.Response

like image 157
lindelof Avatar answered Sep 19 '22 23:09

lindelof