Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

If I get a 500 internal server error in Scrapy, how do I skip the URL?

I'm scraping data off several thousand pages with the general URL of:

http://example.com/database/?id=(some number)

where I am running through the id numbers.

I keep encountering huge chunks of URLs that generate a 500 internal server error, and scrapy goes over these chunks several times for some reason. This eats up a lot of time, so I am wondering if there is a way to just move onto the next URL immediately and not have scrapy send requests several times.

like image 305
galilei Avatar asked May 22 '14 03:05

galilei


1 Answers

The component retrying 500 errors is RetryMiddleware.

If you do not want Scrapy to retry requests that received 500 status code, in your settings.py you can set RETRY_HTTP_CODES to not include 500 (default is [500, 502, 503, 504, 400, 408]), or disable the RetryMiddleware altogether with RETRY_ENABLED = False

See RetryMiddleware settings for more.

like image 169
paul trmbrth Avatar answered Oct 06 '22 00:10

paul trmbrth