How to crawl with breadth-first search using Scrapy (Python3)?

Tags:

python-3.x

scrapy

I want to run the crawler as breadth-first search.
So, I wrote the following code.

from scrapy.spider import BaseSpider
from scrapy.http   import Request

class MySpider(BaseSpider):
    start_urls = ['http://example.com/a']

    def parse(self, response):
        next_a = response.css('.next::attr(href)').extract_first()
        if next_a:
            yield Request(next_a, callback=self.parse, priority = 3)

        for b in response.css('.b::attr("href")'):
            yield Request(b, callback=self.parse_b, priority = 2)

    def parse_b(self, response):
        pass

I am hoping that this crawler will move as follows.
a1, a2, a3, ..., an, b1, b2, b3, ...., bn

But actually it will be like this.
a1, b1, b2, ...., b_n1, a2, b_n1+1, b_n1+2, ....

How can I run as expected?

519

asked Mar 31 '17 14:03

asari72

1 Answers

Quoting from here

Does Scrapy crawl in breadth-first or depth-first order?

By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

175

answered Oct 23 '22 02:10

Umair Ayub

Related questions
                            
                                Pandas Flatten a dataframe to a single column
                            
                                Create dictionary from another dictionary with the fastest and scalable way
                            
                                What is a <class 'range'>
                            
                                peewee - Define models separately from Database() initialization
                            
                                Understanding python slicings syntax as described in the python language reference
                            
                                csv.reader read from Requests stream: iterator should return strings, not bytes
                            
                                Sorting key function that uses custom comparison [duplicate]
                            
                                how to apply BREAK for Itertools count in List Comprehensions?
                            
                                convert csv to a string variable
                            
                                Python attributes and descriptors
                            
                                TypeError: object of type 'method' has no len() [closed]
                            
                                How can I convert literal escape sequences in a string to the corresponding bytes? [duplicate]
                            
                                Peewee ORM - Copying data from multiple database in a main one
                            
                                PyCharm: Intellisense or auto-complete not working with Python 3.5.2
                            
                                Passing a file-like object to write() method of another file-like object
                            
                                How to determine if running current process is parent?
                            
                                Compare two dictionaries, remove key/value pair in one dict if it exists in the other
                            
                                finding values in pandas series - Python3
                            
                                Multiprocessing - tkinter pipeline communication
                            
                                Why does map over an iterable return a one-shot iterable?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With