Dynamically change scrapy Request scheduler priority

Question

I'm using scrapy to perform test on an internal web app. Once all my tests are done, I use CrawlSpider to check everywhere, and I run for each response a HTML validator and I look for 404 media files.

It work very well except for this: the crawl at the end, GET things in a random order... So, URL that perform DELETE operation are being executed before other operations.

I would like to schedule all delete at the end. I tried many way, with such kind of scheduler:

from scrapy import log

class DeleteDelayer(object):
    def enqueue_request(self, spider, request):
        if request.url.find('delete') != -1:
            log.msg("delay %s" % request.url, log.DEBUG)
            request.priority = 50

But it does not work... I see delete being "delay" in the log but they are executed during the execution.

I thought of using a middleware that can pile up in memory all the delete URL and when the spider_idle signal is called to put them back in, but I'm not sure on how to do this.

What is the best way to acheive this?

Hank Yang · Accepted Answer

default priority for request is 0, so you set priority to 50 will not work
you can use a middleware to collect (insert the requests into your own queue, e.g, redis set) and ignore (return IngnoreRequest Exception) those 'delete' request
start a 2nd crawl with requests load from your queue in step 2

Dynamically change scrapy Request scheduler priority

Tags:

python

scrapy

thinker007

1 Answers

Hank Yang

Recent Activity

Donate For Us

Dynamically change scrapy Request scheduler priority

Tags:

python

scrapy

thinker007

1 Answers

Hank Yang

Related questions

Recent Activity

Donate For Us