Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrying a Scrapy Request even when receiving a 200 status code

There is a website I'm scraping that will sometimes return a 200, but not have any text in response.body (raises an AttributeError when I try to parse it with Selector).

Is there a simple way to check to make sure the body includes text, and if not, retry the request until it does? Here is some pseudocode to outline what I'm trying to do.

def check_response(response):
    if response.body != '':
        return response
    else:
        return Request(copy_of_response.request,
                       callback=check_response)

Basically, is there a way I can repeat a request with the exact same properties (method, url, payload, cookies, etc.)?

like image 739
chr1sbest Avatar asked Mar 06 '26 17:03

chr1sbest


2 Answers

Follow the EAFP principle:

Easier to ask for forgiveness than permission. This common Python coding style assumes the existence of valid keys or attributes and catches exceptions if the assumption proves false. This clean and fast style is characterized by the presence of many try and except statements. The technique contrasts with the LBYL style common to many other languages such as C.

Handle an exception and yield a Request to the current url with dont_filter=True:

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

def parse(response):
    try:
        # parsing logic here
    except AttributeError:
        yield Request(response.url, callback=self.parse, dont_filter=True)

You can also make a copy of the current request (not tested):

new_request = response.request.copy()
new_request.dont_filter = True
yield new_request

Or, make a new request using replace():

new_request = response.request.replace(dont_filter=True)
yield new_request
like image 188
alecxe Avatar answered Mar 08 '26 07:03

alecxe


How about calling actual _rety() method from retry middleware, so it acts as a normal retry with all it's logic that takes settings into account?

In settings:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'scraper.middlewares.retry.RetryMiddleware': 550,
}

Then your retry middleware could be like:

from scrapy.downloadermiddlewares.retry import RetryMiddleware \
    as BaseRetryMiddleware


class RetryMiddleware(BaseRetryMiddleware):


    def process_response(self, request, response, spider):
        # inject retry method so request could be retried by some conditions
        # from spider itself even on 200 responses
        if not hasattr(spider, '_retry'):
            spider._retry = self._retry
        return super(RetryMiddleware, self).process_response(request, response, spider)

And then in your success response callback you can call for ex.:

yield self._retry(response.request, ValueError, self)
like image 42
Dmitriy Avatar answered Mar 08 '26 08:03

Dmitriy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!