Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I make scrapy crawl break and exit when encountering the first exception?

Tags:

For development purposes, I would like to stop all scrapy crawling activity as soon a first exception (in a spider or a pipeline) occurs.

Any advice?

like image 525
Udi Avatar asked Mar 01 '12 22:03

Udi


People also ask

How do you exit a Scrapy shell?

Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling: >>> ^D 2014-01-23 17:50:03-0400 [scrapy.

What does Scrapy crawl do?

Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them. Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or lxml) instead, if you feel more comfortable working with them.

What is Start_urls in Scrapy?

start_urls contain those links from which the spider start crawling. If you want crawl recursively you should use crawlspider and define rules for that. http://doc.scrapy.org/en/latest/topics/spiders.html look there for example.

What is a spider in Scrapy?

Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).


1 Answers

In spider, you can just throw CloseSpider exception.

def parse_page(self, response):     if 'Bandwidth exceeded' in response.body:         raise CloseSpider('bandwidth_exceeded') 

For others (middlewares, pipeline, etc), you can manually call close_spider as akhter mentioned.

like image 78
imwilsonxu Avatar answered Sep 28 '22 04:09

imwilsonxu