I'm writing a small script in Python 2.7.3 with GRequests and lxml that will allow me to gather some collectible card prices from various websites and compare them. Problem is one of the websites limits the number of requests and sends back HTTP error 429 if I exceed it.
Is there a way to add throttling the number of requests in GRequestes so that I don't exceed the number of requests per second I specify? Also - how can I make GRequestes retry after some time if HTTP 429 occurs?
On a side note - their limit is ridiculously low. Something like 8 requests per 15 seconds. I breached it with my browser on multiple occasions just refreshing the page waiting for price changes.
Rate Limiting and Throttling policies are designed to limit API access, but have different intentions: Rate limiting protects an API by applying a hard limit on its access. Throttling shapes API access by smoothing spikes in traffic.
A rate limit is the number of API calls an app or user can make within a given time period. If this limit is exceeded or if CPU or total time limits are exceeded, the app or user may be throttled. API requests made by a throttled user or app will fail. All API requests are subject to rate limits.
How does rate limiting work? Rate limiting runs within an application, rather than running on the web server itself. Typically, rate limiting is based on tracking the IP addresses that requests are coming from, and tracking how much time elapses between each request.
Add a wait() command inside your workers to get them waiting between the requests (in the example from documentation: inside the "while true" after the task_done). Example: 5 "Worker"-Threads with a waiting time of 1 sec between the requests will do less then 5 fetches per second. Save this answer.
Going to answer my own question since I had to figure this by myself and there seems to be very little info on this going around.
The idea is as follows. Every request object used with GRequests can take a session object as a parameter when created. Session objects on the other hand can have HTTP adapters mounted that are used when making requests. By creating our own adapter we can intercept requests and rate-limit them in way we find best for our application. In my case I ended up with the code below.
Object used for throttling:
DEFAULT_BURST_WINDOW = datetime.timedelta(seconds=5) DEFAULT_WAIT_WINDOW = datetime.timedelta(seconds=15) class BurstThrottle(object): max_hits = None hits = None burst_window = None total_window = None timestamp = None def __init__(self, max_hits, burst_window, wait_window): self.max_hits = max_hits self.hits = 0 self.burst_window = burst_window self.total_window = burst_window + wait_window self.timestamp = datetime.datetime.min def throttle(self): now = datetime.datetime.utcnow() if now < self.timestamp + self.total_window: if (now < self.timestamp + self.burst_window) and (self.hits < self.max_hits): self.hits += 1 return datetime.timedelta(0) else: return self.timestamp + self.total_window - now else: self.timestamp = now self.hits = 1 return datetime.timedelta(0)
HTTP adapter:
class MyHttpAdapter(requests.adapters.HTTPAdapter): throttle = None def __init__(self, pool_connections=requests.adapters.DEFAULT_POOLSIZE, pool_maxsize=requests.adapters.DEFAULT_POOLSIZE, max_retries=requests.adapters.DEFAULT_RETRIES, pool_block=requests.adapters.DEFAULT_POOLBLOCK, burst_window=DEFAULT_BURST_WINDOW, wait_window=DEFAULT_WAIT_WINDOW): self.throttle = BurstThrottle(pool_maxsize, burst_window, wait_window) super(MyHttpAdapter, self).__init__(pool_connections=pool_connections, pool_maxsize=pool_maxsize, max_retries=max_retries, pool_block=pool_block) def send(self, request, stream=False, timeout=None, verify=True, cert=None, proxies=None): request_successful = False response = None while not request_successful: wait_time = self.throttle.throttle() while wait_time > datetime.timedelta(0): gevent.sleep(wait_time.total_seconds(), ref=True) wait_time = self.throttle.throttle() response = super(MyHttpAdapter, self).send(request, stream=stream, timeout=timeout, verify=verify, cert=cert, proxies=proxies) if response.status_code != 429: request_successful = True return response
Setup:
requests_adapter = adapter.MyHttpAdapter( pool_connections=__CONCURRENT_LIMIT__, pool_maxsize=__CONCURRENT_LIMIT__, max_retries=0, pool_block=False, burst_window=datetime.timedelta(seconds=5), wait_window=datetime.timedelta(seconds=20)) requests_session = requests.session() requests_session.mount('http://', requests_adapter) requests_session.mount('https://', requests_adapter) unsent_requests = (grequests.get(url, hooks={'response': handle_response}, session=requests_session) for url in urls) grequests.map(unsent_requests, size=__CONCURRENT_LIMIT__)
Take a look at this for automatic requests throttling: https://pypi.python.org/pypi/RequestsThrottler/0.2.2
You can set both a fixed amount of delay between each request or set a number of requests to send in a fixed amount of seconds (which is basically the same thing):
import requests from requests_throttler import BaseThrottler request = requests.Request(method='GET', url='http://www.google.com') reqs = [request for i in range(0, 5)] # An example list of requests with BaseThrottler(name='base-throttler', delay=1.5) as bt: throttled_requests = bt.multi_submit(reqs)
where the function multi_submit
returns a list of ThrottledRequest
(see doc: link at the end).
You can then access to the responses:
for tr in throttled_requests: print tr.response
Alternatively you can achieve the same by specifying the number or requests to send in a fixed amount of time (e.g. 15 requests every 60 seconds):
import requests from requests_throttler import BaseThrottler request = requests.Request(method='GET', url='http://www.google.com') reqs = [request for i in range(0, 5)] # An example list of requests with BaseThrottler(name='base-throttler', reqs_over_time=(15, 60)) as bt: throttled_requests = bt.multi_submit(reqs)
Both solutions can be implemented without the usage of the with
statement:
import requests from requests_throttler import BaseThrottler request = requests.Request(method='GET', url='http://www.google.com') reqs = [request for i in range(0, 5)] # An example list of requests bt = BaseThrottler(name='base-throttler', delay=1.5) bt.start() throttled_requests = bt.multi_submit(reqs) bt.shutdown()
For more details: http://pythonhosted.org/RequestsThrottler/index.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With