Retrying a Scrapy Request even when receiving a 200 status code

Question

There is a website I'm scraping that will sometimes return a 200, but not have any text in response.body (raises an AttributeError when I try to parse it with Selector).

Is there a simple way to check to make sure the body includes text, and if not, retry the request until it does? Here is some pseudocode to outline what I'm trying to do.

def check_response(response):
    if response.body != '':
        return response
    else:
        return Request(copy_of_response.request,
                       callback=check_response)

Basically, is there a way I can repeat a request with the exact same properties (method, url, payload, cookies, etc.)?

alecxe · Accepted Answer

Follow the EAFP principle:

Easier to ask for forgiveness than permission. This common Python coding style assumes the existence of valid keys or attributes and catches exceptions if the assumption proves false. This clean and fast style is characterized by the presence of many try and except statements. The technique contrasts with the LBYL style common to many other languages such as C.

Handle an exception and yield a Request to the current url with dont_filter=True:

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

def parse(response):
    try:
        # parsing logic here
    except AttributeError:
        yield Request(response.url, callback=self.parse, dont_filter=True)

You can also make a copy of the current request (not tested):

new_request = response.request.copy()
new_request.dont_filter = True
yield new_request

Or, make a new request using replace():

new_request = response.request.replace(dont_filter=True)
yield new_request

Dmitriy · Answer

How about calling actual _rety() method from retry middleware, so it acts as a normal retry with all it's logic that takes settings into account?

In settings:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'scraper.middlewares.retry.RetryMiddleware': 550,
}

Then your retry middleware could be like:

from scrapy.downloadermiddlewares.retry import RetryMiddleware \
    as BaseRetryMiddleware


class RetryMiddleware(BaseRetryMiddleware):


    def process_response(self, request, response, spider):
        # inject retry method so request could be retried by some conditions
        # from spider itself even on 200 responses
        if not hasattr(spider, '_retry'):
            spider._retry = self._retry
        return super(RetryMiddleware, self).process_response(request, response, spider)

And then in your success response callback you can call for ex.:

yield self._retry(response.request, ValueError, self)

Retrying a Scrapy Request even when receiving a 200 status code

Tags:

python

web-scraping

scrapy

chr1sbest

2 Answers

alecxe

Dmitriy

Recent Activity

Donate For Us

Retrying a Scrapy Request even when receiving a 200 status code

Tags:

python

web-scraping

scrapy

chr1sbest

2 Answers

alecxe

Dmitriy

Related questions

Recent Activity

Donate For Us