Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy handle 302 response code

I am using a simple CrawlSpider implementation to crawl websites. By default Scrapy follows 302 redirects to target locations and kind of ignores the originally requested link. On a particular site I encountered a page which 302 redirects to another page. What I aim to do is log both the original link(which responds 302) and the target location(specified in HTTP response header) and process them in parse_item method of CrawlSpider. Please guide me, how can I achieve this ?

I came across solutions mentioning to use dont_redirect=True or REDIRECT_ENABLE=False but I do not actually want to ignore the redirects, in fact I want to consider(i.e not ignore) the redirecting page as well.

eg: I visit http://www.example.com/page1 which sends a 302 redirect HTTP response and redirects to http://www.example.com/page2. By default, scrapy ignore page1, follows to page2 and processes it. I want to process both page1 and page2 in parse_item.

EDIT I am already using handle_httpstatus_list = [500, 404] in class definition of spider to handle 500 and 404 response codes in parse_item, but the same is not working for 302 if I specify it in handle_httpstatus_list.

like image 626
bawejakunal Avatar asked Dec 24 '22 09:12

bawejakunal


2 Answers

Scrapy 1.0.5 (latest official as I write these lines) does not use handle_httpstatus_list in the built-in RedirectMiddleware -- see this issue. From Scrapy 1.1.0 (1.1.0rc1 is available), the issue is fixed.

Even if you disable redirects, you can still mimic its behavior in your callback, checking the Location header and returning a Request to the redirection

Example spider:

$ cat redirecttest.py
import scrapy


class RedirectTest(scrapy.Spider):

    name = "redirecttest"
    start_urls = [
        'http://httpbin.org/get',
        'https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip'
    ]
    handle_httpstatus_list = [302]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, dont_filter=True, callback=self.parse_page)

    def parse_page(self, response):
        self.logger.debug("(parse_page) response: status=%d, URL=%s" % (response.status, response.url))
        if response.status in (302,) and 'Location' in response.headers:
            self.logger.debug("(parse_page) Location header: %r" % response.headers['Location'])
            yield scrapy.Request(
                response.urljoin(response.headers['Location']),
                callback=self.parse_page)

Console log:

$ scrapy runspider redirecttest.py -s REDIRECT_ENABLED=0
[scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
[scrapy] INFO: Optional features available: ssl, http11
[scrapy] INFO: Overridden settings: {'REDIRECT_ENABLED': '0'}
[scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
[scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
[scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
[scrapy] INFO: Enabled item pipelines: 
[scrapy] INFO: Spider opened
[scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
[scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/get> (referer: None)
[redirecttest] DEBUG: (parse_page) response: status=200, URL=http://httpbin.org/get
[scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip> (referer: None)
[redirecttest] DEBUG: (parse_page) response: status=302, URL=https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip
[redirecttest] DEBUG: (parse_page) Location header: 'http://httpbin.org/ip'
[scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/ip> (referer: https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip)
[redirecttest] DEBUG: (parse_page) response: status=200, URL=http://httpbin.org/ip
[scrapy] INFO: Closing spider (finished)

Note that you'll need http_handlestatus_list with 302 in it, otherwise, you'll see this kind of log (coming from HttpErrorMiddleware):

[scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip> (referer: None)
[scrapy] DEBUG: Ignoring response <302 https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip>: HTTP status code is not handled or not allowed
like image 97
paul trmbrth Avatar answered Jan 10 '23 19:01

paul trmbrth


The redirect middleware will "catch" the response before it reaches your httperror middleware and launch a new request with the redirect url. At the same time the original response is not returned i.e., you don't even "see" the 302 codes as they don't reach httperror. Thus having 302 in handle_httpstatus_list has no effect.

Have a look at its source in scrapy.downloadermiddlewares.redirect.RedirectMiddleware: In process_response(), you see what is happening. It launches a new request and replaces the original URL with the redirected_url. No "return response" -> the original response just gets discarded.

Basically you just need to overwrite the process_response() function by adding a line with "return response", in addition to sending another request with the redirected_url.

In parse_item, you probably want to set some conditional statements, depending if it is a redirect or not? I suppose it will not be exactly look the same, so maybe your item will also look quite different. Another option could also be to use a different parser for either response (depending on if the original or redirected url are "special pages"), all you then need is to have a different parse function e.g., parse_redirected_urls(), in your spider and call that parse function via callback in the redirect request

like image 37
Ruehri Avatar answered Jan 10 '23 20:01

Ruehri