How to handle a 429 Too Many Requests response in Scrapy?

Tags:

scrapy

I'm trying to run a scraper of which the output log ends as follows:

2017-04-25 20:22:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 http://www.apkmirror.com/apk/instagram/instagram-instagram/instagram-instagram-9-0-0-34920-release/instagram-9-0-0-4-android-apk-download/>: HTTP status code is not handled or not allowed
2017-04-25 20:22:22 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-25 20:22:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 16048410,
 'downloader/request_count': 32902,
 'downloader/request_method_count/GET': 32902,
 'downloader/response_bytes': 117633316,
 'downloader/response_count': 32902,
 'downloader/response_status_count/200': 121,
 'downloader/response_status_count/429': 32781,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 25, 18, 22, 22, 710446),
 'log_count/DEBUG': 32903,
 'log_count/INFO': 32815,
 'request_depth_max': 2,
 'response_received_count': 32902,
 'scheduler/dequeued': 32902,
 'scheduler/dequeued/memory': 32902,
 'scheduler/enqueued': 32902,
 'scheduler/enqueued/memory': 32902,
 'start_time': datetime.datetime(2017, 4, 25, 17, 54, 36, 621481)}
2017-04-25 20:22:22 [scrapy.core.engine] INFO: Spider closed (finished)

In short, of the 32,902 requests, only 121 are successful (response code 200) whereas the remainder receives 429 for 'too many requests' (cf. https://httpstatuses.com/429).

Are there any recommended ways to get around this? To start with, I'd like to have a look at the details of the 429 response rather than just ignoring it, as it may contain a Retry-After header indicating how long to wait before making a new request.

Also, if the requests are made using Privoxy and Tor as described in http://blog.michaelyin.info/2014/02/19/scrapy-socket-proxy/, it may be possible to implement retry middleware which makes Tor change its IP address when this occurs. Are there any public examples of such code?

327

asked Apr 26 '17 09:04

Kurt Peek

2 Answers

You can modify the retry middleware to pause when it gets error 429. Put this code below in middlewares.py

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message

import time

class TooManyRequestsRetryMiddleware(RetryMiddleware):

    def __init__(self, crawler):
        super(TooManyRequestsRetryMiddleware, self).__init__(crawler.settings)
        self.crawler = crawler

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        elif response.status == 429:
            self.crawler.engine.pause()
            time.sleep(60) # If the rate limit is renewed in a minute, put 60 seconds, and so on.
            self.crawler.engine.unpause()
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response
        elif response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response
        return response

Add 429 to retry codes in settings.py

RETRY_HTTP_CODES = [429]

Then activate it on settings.py. Don't forget to deactivate the default retry middleware.

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'flat.middlewares.TooManyRequestsRetryMiddleware': 543,
}

104

answered Sep 23 '22 11:09

Aminah Nuraini

Wow, your scraper is going really fast, over 30,000 requests in 30 minutes. That's more than 10 requests per second.

Such a high volume will trigger rate limiting on bigger sites and will completely bring down smaller sites. Don't do that.

Also this might even be too fast for privoxy and tor, so these might also be candidates for those replies with a 429.

Solutions:

Start slow. Reduce the concurrency settings and increase DOWNLOAD_DELAY so you do at max 1 request per second. Then increase these values step by step and see what happens. It might sound paradox, but you might be able to get more items and more 200 response by going slower.
If you are scraping a big site try rotating proxies. The tor network might be a bit heavy handed for this in my experience, so you might try a proxy service like Umair is suggesting

answered Sep 25 '22 11:09

Done Data Solutions

Related questions
                            
                                How run a scrapy spider programmatically like a simple script?
                            
                                puppeteer execute a js function on the chosen page
                            
                                Using Jest with Puppeteer : Evaluation failed: ReferenceError: cov_4kq3tptqc is not defined
                            
                                How to Download PDFs from Scraped Links [Python]?
                            
                                How to retrieve informations about journals from ISI Web of Knowledge?
                            
                                Scrapy on a schedule
                            
                                how to get tbody from table from python beautiful soup ?
                            
                                Python's requests library timing out but getting the response from the browser
                            
                                Web scraping with R over real estate ads
                            
                                Web scraping with Google Apps Script
                            
                                Clicking link using beautifulsoup in python
                            
                                Crawling through pages with PostBack data javascript Python Scrapy
                            
                                Extract the text from `p` within `div` with BeautifulSoup
                            
                                Using Scrapy to crawl a public FTP server
                            
                                Python selenium webdriver "Session not created" exception when opening Chrome
                            
                                Can't scrape all the company names from a webpage
                            
                                UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 17710: ordinal not in range(128)
                            
                                To exceed the ImportXML limit on Google Spreadsheet
                            
                                Web scraping with Python [closed]
                            
                                How to Extract Instagram Data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With