Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dynamically change scrapy Request scheduler priority

Tags:

python

scrapy

I'm using scrapy to perform test on an internal web app. Once all my tests are done, I use CrawlSpider to check everywhere, and I run for each response a HTML validator and I look for 404 media files.

It work very well except for this: the crawl at the end, GET things in a random order... So, URL that perform DELETE operation are being executed before other operations.

I would like to schedule all delete at the end. I tried many way, with such kind of scheduler:

from scrapy import log

class DeleteDelayer(object):
    def enqueue_request(self, spider, request):
        if request.url.find('delete') != -1:
            log.msg("delay %s" % request.url, log.DEBUG)
            request.priority = 50

But it does not work... I see delete being "delay" in the log but they are executed during the execution.

I thought of using a middleware that can pile up in memory all the delete URL and when the spider_idle signal is called to put them back in, but I'm not sure on how to do this.

What is the best way to acheive this?

like image 829
thinker007 Avatar asked Nov 03 '22 21:11

thinker007


1 Answers

  1. default priority for request is 0, so you set priority to 50 will not work
  2. you can use a middleware to collect (insert the requests into your own queue, e.g, redis set) and ignore (return IngnoreRequest Exception) those 'delete' request
  3. start a 2nd crawl with requests load from your queue in step 2
like image 166
Hank Yang Avatar answered Nov 09 '22 07:11

Hank Yang