I'm scraping data off several thousand pages with the general URL of:
http://example.com/database/?id=(some number)
where I am running through the id numbers.
I keep encountering huge chunks of URLs that generate a 500 internal server error, and scrapy goes over these chunks several times for some reason. This eats up a lot of time, so I am wondering if there is a way to just move onto the next URL immediately and not have scrapy send requests several times.
The component retrying 500 errors is RetryMiddleware.
If you do not want Scrapy to retry requests that received 500 status code, in your settings.py
you can set RETRY_HTTP_CODES
to not include 500 (default is [500, 502, 503, 504, 400, 408]
), or disable the RetryMiddleware altogether with RETRY_ENABLED = False
See RetryMiddleware settings for more.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With