Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle a 429 Too Many Requests response in Scrapy?

I'm trying to run a scraper of which the output log ends as follows:

2017-04-25 20:22:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 http://www.apkmirror.com/apk/instagram/instagram-instagram/instagram-instagram-9-0-0-34920-release/instagram-9-0-0-4-android-apk-download/>: HTTP status code is not handled or not allowed
2017-04-25 20:22:22 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-25 20:22:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 16048410,
 'downloader/request_count': 32902,
 'downloader/request_method_count/GET': 32902,
 'downloader/response_bytes': 117633316,
 'downloader/response_count': 32902,
 'downloader/response_status_count/200': 121,
 'downloader/response_status_count/429': 32781,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 25, 18, 22, 22, 710446),
 'log_count/DEBUG': 32903,
 'log_count/INFO': 32815,
 'request_depth_max': 2,
 'response_received_count': 32902,
 'scheduler/dequeued': 32902,
 'scheduler/dequeued/memory': 32902,
 'scheduler/enqueued': 32902,
 'scheduler/enqueued/memory': 32902,
 'start_time': datetime.datetime(2017, 4, 25, 17, 54, 36, 621481)}
2017-04-25 20:22:22 [scrapy.core.engine] INFO: Spider closed (finished)

In short, of the 32,902 requests, only 121 are successful (response code 200) whereas the remainder receives 429 for 'too many requests' (cf. https://httpstatuses.com/429).

Are there any recommended ways to get around this? To start with, I'd like to have a look at the details of the 429 response rather than just ignoring it, as it may contain a Retry-After header indicating how long to wait before making a new request.

Also, if the requests are made using Privoxy and Tor as described in http://blog.michaelyin.info/2014/02/19/scrapy-socket-proxy/, it may be possible to implement retry middleware which makes Tor change its IP address when this occurs. Are there any public examples of such code?

like image 327
Kurt Peek Avatar asked Apr 26 '17 09:04

Kurt Peek


People also ask

Does Scrapy get blocked?

While start running the scrapy spider, after finite amount of request, the scrapy gets blocked by that website. Now here is the main part iam going to check, after how many request the scrapy get's blocked.

What does Scrapy request return?

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

Is Scrapy scalable?

Scrapy is a popular open-source Python framework for writing scalable web scrapers.

What does Scrapy crawl do?

Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them. Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or lxml) instead, if you feel more comfortable working with them.


2 Answers

You can modify the retry middleware to pause when it gets error 429. Put this code below in middlewares.py

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message

import time

class TooManyRequestsRetryMiddleware(RetryMiddleware):

    def __init__(self, crawler):
        super(TooManyRequestsRetryMiddleware, self).__init__(crawler.settings)
        self.crawler = crawler

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        elif response.status == 429:
            self.crawler.engine.pause()
            time.sleep(60) # If the rate limit is renewed in a minute, put 60 seconds, and so on.
            self.crawler.engine.unpause()
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response
        elif response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response
        return response 

Add 429 to retry codes in settings.py

RETRY_HTTP_CODES = [429]

Then activate it on settings.py. Don't forget to deactivate the default retry middleware.

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'flat.middlewares.TooManyRequestsRetryMiddleware': 543,
}
like image 104
Aminah Nuraini Avatar answered Sep 23 '22 11:09

Aminah Nuraini


Wow, your scraper is going really fast, over 30,000 requests in 30 minutes. That's more than 10 requests per second.

Such a high volume will trigger rate limiting on bigger sites and will completely bring down smaller sites. Don't do that.

Also this might even be too fast for privoxy and tor, so these might also be candidates for those replies with a 429.

Solutions:

  1. Start slow. Reduce the concurrency settings and increase DOWNLOAD_DELAY so you do at max 1 request per second. Then increase these values step by step and see what happens. It might sound paradox, but you might be able to get more items and more 200 response by going slower.

  2. If you are scraping a big site try rotating proxies. The tor network might be a bit heavy handed for this in my experience, so you might try a proxy service like Umair is suggesting

like image 20
Done Data Solutions Avatar answered Sep 25 '22 11:09

Done Data Solutions