I'm trying to scrap a range of webpages but I got holes, sometimes it looks like the website fails to send the html response correctly. This results in the csv output file to have empty lines. How would one do to retry n times the request and the parse when the xpath selector on the response is empty ? Note that I don't have any HTTP errors.
you could do this with a Custom Retry Middleware, you just need to override the process_response
method of the current Retry Middleware:
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
# this is your check
if response.status == 200 and response.xpath(spider.retry_xpath):
return self._retry(request, 'response got xpath "{}"'.format(spider.retry_xpath), spider) or response
return response
Then enable it instead of the default RetryMiddleware
in settings.py
:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'myproject.middlewarefilepath.CustomRetryMiddleware': 550,
}
Now you have a middleware where you can configure the xpath
to retry inside your spider with the attribute retry_xpath
:
class MySpider(Spider):
name = "myspidername"
retry_xpath = '//h2[@class="tadasdop-cat"]'
...
This won't necessarily retry when your Item's field is empty, but you can specify the same path of that field in this retry_xpath
attribute to make it work.
You can set RETRY_TIMES
setting in settings.py
to the amount of times you wish the pages are retried. It defaults to 2 times.
See more on RetryMiddleware
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With