Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to crawl with breadth-first search using Scrapy (Python3)?

I want to run the crawler as breadth-first search.
So, I wrote the following code.

from scrapy.spider import BaseSpider
from scrapy.http   import Request

class MySpider(BaseSpider):
    start_urls = ['http://example.com/a']

    def parse(self, response):
        next_a = response.css('.next::attr(href)').extract_first()
        if next_a:
            yield Request(next_a, callback=self.parse, priority = 3)

        for b in response.css('.b::attr("href")'):
            yield Request(b, callback=self.parse_b, priority = 2)

    def parse_b(self, response):
        pass

I am hoping that this crawler will move as follows.
a1, a2, a3, ..., an, b1, b2, b3, ...., bn

But actually it will be like this.
a1, b1, b2, ...., b_n1, a2, b_n1+1, b_n1+2, ....

How can I run as expected?

like image 519
asari72 Avatar asked Mar 31 '17 14:03

asari72


People also ask

How do you crawl a website using the Scrapy and Python 3?

Scraping the page will involve two steps. The first step will be grabbing each blog tutorial by looking at the parts of the page containing the data we want. The next step is to pull the data we want from each tutorial identified by the HTML tag. Scrapy identifies the data to grab based on the selectors you provide.

How does Scrapy Python work?

Scrapy provides Item pipelines that allow you to write functions in your spider that can process your data such as validating data, removing data and saving data to a database. It provides spider Contracts to test your spiders and allows you to create generic and deep crawlers as well.

Can you use BeautifulSoup with Scrapy?

Can I use Scrapy with BeautifulSoup? ¶ Yes, you can. As mentioned above, BeautifulSoup can be used for parsing HTML responses in Scrapy callbacks.


1 Answers

Quoting from here

Does Scrapy crawl in breadth-first or depth-first order?

By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
like image 175
Umair Ayub Avatar answered Oct 23 '22 02:10

Umair Ayub