Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Callback for redirected requests Scrapy

I am trying to scrape using scrape framework. Some requests are redirected but the callback function set in the start_requests is not called for these redirected url requests but works fine for the non-redirected ones.

I have the following code in the start_requests function:

for user in users:
    yield scrapy.Request(url=userBaseUrl+str(user['userId']),cookies=cookies,headers=headers,dont_filter=True,callback=self.parse_p)

But this self.parse_p is called only for the Non-302 requests.

like image 310
a'- Avatar asked Sep 04 '25 03:09

a'-


1 Answers

I guess you get a callback for the final page (after the redirect). Redirects are been taken care by the RedirectMiddleware. You could disable it and then you would have to do all the redirects manually. If you wanted to selectively disable redirects for a few types of Requests you can do it like this:

request =  scrapy.Request(url, meta={'dont_redirect': True} callback=self.manual_handle_of_redirects)

I'm not sure that the intermediate Requests/Responses are very interesting though. That's also what RedirectMiddleware believes. As a result, it does the redirects automatically and saves the intermediate URLs (the only interesting thing) in:

response.request.meta.get('redirect_urls')

You have a few options!

Example spider:

import scrapy

class DimSpider(scrapy.Spider):
    name = "dim"

    start_urls = (
        'http://example.com/',
    )

    def parse(self, response):
        yield scrapy.Request(url="http://example.com/redirect302.php", dont_filter=True, callback=self.parse_p)

    def parse_p(self, response):
       print response.request.meta.get('redirect_urls')
       print "done!"

Example output...

DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
DEBUG: Redirecting (302) to <GET http://myredirect.com> from <GET http://example.com/redirect302.php>
DEBUG: Crawled (200) <GET http://myredirect.com/> (referer: http://example.com/redirect302.com/)
['http://example.com/redirect302.php']
done!

If you really want to scrape the 302 pages, you have to explicitcly allow it. For example here, I allow 302 and set dont_redirect to True:

handle_httpstatus_list = [302]
def parse(self, response):
    r = scrapy.Request(url="http://example.com/redirect302.php", dont_filter=True, callback=self.parse_p)
    r.meta['dont_redirect'] = True
    yield r

The end result is:

DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
DEBUG: Crawled (302) <GET http://example.com/redirect302.com/> (referer: http://www.example.com/)
None
done!

This spider should manually follow 302 urls:

import scrapy

class DimSpider(scrapy.Spider):
    name = "dim"

    handle_httpstatus_list = [302]

    def start_requests(self):
        yield scrapy.Request("http://page_with_or_without_redirect.html",
                             callback=self.parse200_or_302, meta={'dont_redirect':True})

    def parse200_or_302(self, response):
        print "I'm on: %s with status %d" % (response.url, response.status)
        if 'location' in response.headers:
            print "redirecting"
            return [scrapy.Request(response.headers['Location'],
                                  callback=self.parse200_or_302, meta={'dont_redirect':True})]

Be careful. Don't omit setting handle_httpstatus_list = [302] otherwise you will get "HTTP status code is not handled or not allowed".

like image 52
neverlastn Avatar answered Sep 07 '25 12:09

neverlastn