I want to run the crawler as breadth-first search.
So, I wrote the following code.
from scrapy.spider import BaseSpider
from scrapy.http import Request
class MySpider(BaseSpider):
start_urls = ['http://example.com/a']
def parse(self, response):
next_a = response.css('.next::attr(href)').extract_first()
if next_a:
yield Request(next_a, callback=self.parse, priority = 3)
for b in response.css('.b::attr("href")'):
yield Request(b, callback=self.parse_b, priority = 2)
def parse_b(self, response):
pass
I am hoping that this crawler will move as follows.
a1, a2, a3, ..., an, b1, b2, b3, ...., bn
But actually it will be like this.
a1, b1, b2, ...., b_n1, a2, b_n1+1, b_n1+2, ....
How can I run as expected?
Scraping the page will involve two steps. The first step will be grabbing each blog tutorial by looking at the parts of the page containing the data we want. The next step is to pull the data we want from each tutorial identified by the HTML tag. Scrapy identifies the data to grab based on the selectors you provide.
Scrapy provides Item pipelines that allow you to write functions in your spider that can process your data such as validating data, removing data and saving data to a database. It provides spider Contracts to test your spiders and allows you to create generic and deep crawlers as well.
Can I use Scrapy with BeautifulSoup? ¶ Yes, you can. As mentioned above, BeautifulSoup can be used for parsing HTML responses in Scrapy callbacks.
Quoting from here
Does Scrapy crawl in breadth-first or depth-first order?
By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:
DEPTH_PRIORITY = 1 SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue' SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With