Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy doesn't seem to be doing DFO

I have a website for which my crawler needs to follow a sequence. So for example, it needs to go a1, b1, c1 before it starts going a2 etc. each of a, b and c are handled by different parse functions and the corresponding urls are created in a Request object and yielded. The following roughly illustrates the code I'm using:

class aspider(BaseSpider):

    def parse(self,response):
        yield Request(b, callback=self.parse_b, priority=10)

    def parse_b(self,response):
        yield Request(c, callback=self.parse_c, priority=20)

    def parse_c(self,response)
        final_function()        

However, I find that the sequence of crawls seem to be a1,a2,a3,b1,b2,b3,c1,c2,c3 which is strange since I thought Scrapy is supposed to guarantee depth first.

The sequence doesn't have to be strict, but the site I'm scraping has a limit in place so Scrapy need to start scraping level c as soon as it can before 5 of level bs get crawled. How can this be achieved?

like image 740
Mishari Avatar asked Mar 03 '12 17:03

Mishari


2 Answers

Depth first searching is exactly what you are describing:

search as deep into a's as possible before moving to b's

To change Scrapy to do breadth-first searching (a1, b1, c1, a2, etc...), change these settings:

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

*Found in the doc.scrapy.org FAQ

like image 92
Bryan Wolfford Avatar answered Nov 15 '22 13:11

Bryan Wolfford


I believe that you are noticing the difference between depth-first and breadth-first searching algorithms (see Wikipedia for info on both.)

Scrapy has the ability to change which algorithm is used:

"By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:"

See http://doc.scrapy.org/en/0.14/faq.html for more information.

like image 29
Peter Kirby Avatar answered Nov 15 '22 14:11

Peter Kirby