I have a website for which my crawler needs to follow a sequence. So for example, it needs to go a1, b1, c1 before it starts going a2 etc. each of a, b and c are handled by different parse functions and the corresponding urls are created in a Request object and yielded. The following roughly illustrates the code I'm using:
class aspider(BaseSpider):
def parse(self,response):
yield Request(b, callback=self.parse_b, priority=10)
def parse_b(self,response):
yield Request(c, callback=self.parse_c, priority=20)
def parse_c(self,response)
final_function()
However, I find that the sequence of crawls seem to be a1,a2,a3,b1,b2,b3,c1,c2,c3 which is strange since I thought Scrapy is supposed to guarantee depth first.
The sequence doesn't have to be strict, but the site I'm scraping has a limit in place so Scrapy need to start scraping level c as soon as it can before 5 of level bs get crawled. How can this be achieved?
Depth first searching is exactly what you are describing:
search as deep into a's as possible before moving to b's
To change Scrapy to do breadth-first searching (a1, b1, c1, a2, etc...), change these settings:
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
*Found in the doc.scrapy.org FAQ
I believe that you are noticing the difference between depth-first and breadth-first searching algorithms (see Wikipedia for info on both.)
Scrapy has the ability to change which algorithm is used:
"By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:"
See http://doc.scrapy.org/en/0.14/faq.html for more information.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With