Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to retry the request n times when an item gets an empty field?

Tags:

scrapy

I'm trying to scrap a range of webpages but I got holes, sometimes it looks like the website fails to send the html response correctly. This results in the csv output file to have empty lines. How would one do to retry n times the request and the parse when the xpath selector on the response is empty ? Note that I don't have any HTTP errors.

like image 654
ChiseledAbs Avatar asked Dec 31 '16 00:12

ChiseledAbs


2 Answers

you could do this with a Custom Retry Middleware, you just need to override the process_response method of the current Retry Middleware:

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message


class CustomRetryMiddleware(RetryMiddleware):

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response

        # this is your check
        if response.status == 200 and response.xpath(spider.retry_xpath):
            return self._retry(request, 'response got xpath "{}"'.format(spider.retry_xpath), spider) or response
        return response

Then enable it instead of the default RetryMiddleware in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'myproject.middlewarefilepath.CustomRetryMiddleware': 550,
}

Now you have a middleware where you can configure the xpath to retry inside your spider with the attribute retry_xpath:

class MySpider(Spider):
    name = "myspidername"

    retry_xpath = '//h2[@class="tadasdop-cat"]'
    ...

This won't necessarily retry when your Item's field is empty, but you can specify the same path of that field in this retry_xpath attribute to make it work.

like image 193
eLRuLL Avatar answered Sep 30 '22 02:09

eLRuLL


You can set RETRY_TIMES setting in settings.py to the amount of times you wish the pages are retried. It defaults to 2 times.

See more on RetryMiddleware

like image 27
Granitosaurus Avatar answered Sep 30 '22 03:09

Granitosaurus