Scrapy doesn't seem to be doing DFO

Question

I have a website for which my crawler needs to follow a sequence. So for example, it needs to go a1, b1, c1 before it starts going a2 etc. each of a, b and c are handled by different parse functions and the corresponding urls are created in a Request object and yielded. The following roughly illustrates the code I'm using:

class aspider(BaseSpider):

    def parse(self,response):
        yield Request(b, callback=self.parse_b, priority=10)

    def parse_b(self,response):
        yield Request(c, callback=self.parse_c, priority=20)

    def parse_c(self,response)
        final_function()

However, I find that the sequence of crawls seem to be a1,a2,a3,b1,b2,b3,c1,c2,c3 which is strange since I thought Scrapy is supposed to guarantee depth first.

The sequence doesn't have to be strict, but the site I'm scraping has a limit in place so Scrapy need to start scraping level c as soon as it can before 5 of level bs get crawled. How can this be achieved?

Bryan Wolfford · Accepted Answer

Depth first searching is exactly what you are describing:

search as deep into a's as possible before moving to b's

To change Scrapy to do breadth-first searching (a1, b1, c1, a2, etc...), change these settings:

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

*Found in the doc.scrapy.org FAQ

Peter Kirby · Answer

I believe that you are noticing the difference between depth-first and breadth-first searching algorithms (see Wikipedia for info on both.)

Scrapy has the ability to change which algorithm is used:

"By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:"

See http://doc.scrapy.org/en/0.14/faq.html for more information.

Scrapy doesn't seem to be doing DFO

Tags:

python

web-scraping

scrapy

Mishari

2 Answers

Bryan Wolfford

Peter Kirby

Recent Activity

Donate For Us

Scrapy doesn't seem to be doing DFO

Tags:

python

web-scraping

scrapy

Mishari

2 Answers

Bryan Wolfford

Peter Kirby

Related questions

Recent Activity

Donate For Us