I'm using scrapy to perform test on an internal web app. Once all my tests are done, I use CrawlSpider to check everywhere, and I run for each response a HTML validator and I look for 404 media files.
It work very well except for this: the crawl at the end, GET
things in a random order...
So, URL that perform DELETE operation are being executed before other operations.
I would like to schedule all delete at the end. I tried many way, with such kind of scheduler:
from scrapy import log
class DeleteDelayer(object):
def enqueue_request(self, spider, request):
if request.url.find('delete') != -1:
log.msg("delay %s" % request.url, log.DEBUG)
request.priority = 50
But it does not work... I see delete being "delay" in the log but they are executed during the execution.
I thought of using a middleware that can pile up in memory all the delete URL and when the spider_idle
signal is called to put them back in, but I'm not sure on how to do this.
What is the best way to acheive this?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With