Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Limiting/throttling the rate of HTTP requests in GRequests

I'm writing a small script in Python 2.7.3 with GRequests and lxml that will allow me to gather some collectible card prices from various websites and compare them. Problem is one of the websites limits the number of requests and sends back HTTP error 429 if I exceed it.

Is there a way to add throttling the number of requests in GRequestes so that I don't exceed the number of requests per second I specify? Also - how can I make GRequestes retry after some time if HTTP 429 occurs?

On a side note - their limit is ridiculously low. Something like 8 requests per 15 seconds. I breached it with my browser on multiple occasions just refreshing the page waiting for price changes.

like image 787
Bartłomiej Siwek Avatar asked Nov 27 '13 16:11

Bartłomiej Siwek


People also ask

Is throttling rate limiting?

Rate Limiting and Throttling policies are designed to limit API access, but have different intentions: Rate limiting protects an API by applying a hard limit on its access. Throttling shapes API access by smoothing spikes in traffic.

What is an API rate limiter?

A rate limit is the number of API calls an app or user can make within a given time period. If this limit is exceeded or if CPU or total time limits are exceeded, the app or user may be throttled. API requests made by a throttled user or app will fail. All API requests are subject to rate limits.

How does rate limiter work?

How does rate limiting work? Rate limiting runs within an application, rather than running on the web server itself. Typically, rate limiting is based on tracking the IP addresses that requests are coming from, and tracking how much time elapses between each request.

How do you limit limits per second in Python?

Add a wait() command inside your workers to get them waiting between the requests (in the example from documentation: inside the "while true" after the task_done). Example: 5 "Worker"-Threads with a waiting time of 1 sec between the requests will do less then 5 fetches per second. Save this answer.


2 Answers

Going to answer my own question since I had to figure this by myself and there seems to be very little info on this going around.

The idea is as follows. Every request object used with GRequests can take a session object as a parameter when created. Session objects on the other hand can have HTTP adapters mounted that are used when making requests. By creating our own adapter we can intercept requests and rate-limit them in way we find best for our application. In my case I ended up with the code below.

Object used for throttling:

DEFAULT_BURST_WINDOW = datetime.timedelta(seconds=5) DEFAULT_WAIT_WINDOW = datetime.timedelta(seconds=15)   class BurstThrottle(object):     max_hits = None     hits = None     burst_window = None     total_window = None     timestamp = None      def __init__(self, max_hits, burst_window, wait_window):         self.max_hits = max_hits         self.hits = 0         self.burst_window = burst_window         self.total_window = burst_window + wait_window         self.timestamp = datetime.datetime.min      def throttle(self):         now = datetime.datetime.utcnow()         if now < self.timestamp + self.total_window:             if (now < self.timestamp + self.burst_window) and (self.hits < self.max_hits):                 self.hits += 1                 return datetime.timedelta(0)             else:                 return self.timestamp + self.total_window - now         else:             self.timestamp = now             self.hits = 1             return datetime.timedelta(0) 

HTTP adapter:

class MyHttpAdapter(requests.adapters.HTTPAdapter):     throttle = None      def __init__(self, pool_connections=requests.adapters.DEFAULT_POOLSIZE,                  pool_maxsize=requests.adapters.DEFAULT_POOLSIZE, max_retries=requests.adapters.DEFAULT_RETRIES,                  pool_block=requests.adapters.DEFAULT_POOLBLOCK, burst_window=DEFAULT_BURST_WINDOW,                  wait_window=DEFAULT_WAIT_WINDOW):         self.throttle = BurstThrottle(pool_maxsize, burst_window, wait_window)         super(MyHttpAdapter, self).__init__(pool_connections=pool_connections, pool_maxsize=pool_maxsize,                                             max_retries=max_retries, pool_block=pool_block)      def send(self, request, stream=False, timeout=None, verify=True, cert=None, proxies=None):         request_successful = False         response = None         while not request_successful:             wait_time = self.throttle.throttle()             while wait_time > datetime.timedelta(0):                 gevent.sleep(wait_time.total_seconds(), ref=True)                 wait_time = self.throttle.throttle()              response = super(MyHttpAdapter, self).send(request, stream=stream, timeout=timeout,                                                        verify=verify, cert=cert, proxies=proxies)              if response.status_code != 429:                 request_successful = True          return response 

Setup:

requests_adapter = adapter.MyHttpAdapter(     pool_connections=__CONCURRENT_LIMIT__,     pool_maxsize=__CONCURRENT_LIMIT__,     max_retries=0,     pool_block=False,     burst_window=datetime.timedelta(seconds=5),     wait_window=datetime.timedelta(seconds=20))  requests_session = requests.session() requests_session.mount('http://', requests_adapter) requests_session.mount('https://', requests_adapter)  unsent_requests = (grequests.get(url,                                  hooks={'response': handle_response},                                  session=requests_session) for url in urls) grequests.map(unsent_requests, size=__CONCURRENT_LIMIT__) 
like image 145
Bartłomiej Siwek Avatar answered Oct 04 '22 00:10

Bartłomiej Siwek


Take a look at this for automatic requests throttling: https://pypi.python.org/pypi/RequestsThrottler/0.2.2

You can set both a fixed amount of delay between each request or set a number of requests to send in a fixed amount of seconds (which is basically the same thing):

import requests from requests_throttler import BaseThrottler  request = requests.Request(method='GET', url='http://www.google.com') reqs = [request for i in range(0, 5)]  # An example list of requests with BaseThrottler(name='base-throttler', delay=1.5) as bt:     throttled_requests = bt.multi_submit(reqs) 

where the function multi_submit returns a list of ThrottledRequest (see doc: link at the end).

You can then access to the responses:

for tr in throttled_requests:     print tr.response 

Alternatively you can achieve the same by specifying the number or requests to send in a fixed amount of time (e.g. 15 requests every 60 seconds):

import requests from requests_throttler import BaseThrottler  request = requests.Request(method='GET', url='http://www.google.com') reqs = [request for i in range(0, 5)]  # An example list of requests with BaseThrottler(name='base-throttler', reqs_over_time=(15, 60)) as bt:     throttled_requests = bt.multi_submit(reqs) 

Both solutions can be implemented without the usage of the with statement:

import requests from requests_throttler import BaseThrottler  request = requests.Request(method='GET', url='http://www.google.com') reqs = [request for i in range(0, 5)]  # An example list of requests bt = BaseThrottler(name='base-throttler', delay=1.5) bt.start() throttled_requests = bt.multi_submit(reqs) bt.shutdown() 

For more details: http://pythonhosted.org/RequestsThrottler/index.html

like image 40
se7entyse7en Avatar answered Oct 03 '22 23:10

se7entyse7en