Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently using a rate-limited API (Echo Nest) with distributed clients

Background

Echo Nest have a rate limited API. A given application (identified in requests using an API key) can make up to 120 REST calls a minute. The service response includes an estimate of the total number of calls made in the last minute; repeated abuse of the API (exceeding the limit) may cause the API key to be revoked.

When used from a single machine (a web server providing a service to clients) it is easy to control access - the server has full knowledge of the history of requests and can regulate itself correctly.

But I am working on a program where distributed, independent clients make requests in parallel.

In such a case it is much less clear what an optimal solution would be. And in general the problem appears to be undecidable - if over 120 clients, all with no previous history, make an initial request at the same time, then the rate will be exceeded.

But since this is a personal project, and client use is expected to be sporadic (bursty), and my projects have never been hugely successful, that is not expected to be a huge problem. A more likely problem is that there are times when a smaller number of clients want to make many requests as quickly as possible (for example, a client may need, exceptionally, to make several thousand requests when starting for the first time - it is possible two clients would start at around the same time, so they must cooperate to share the available bandwidth).

Given all the above, what are suitable algorithms for the clients so that they rate-limit appropriately? Note that limited cooperation is possible because the API returns the total number of requests in the last minute for all clients.

Current Solution

My current solution (when the question was written - a better approach is given as an answer) is quite simple. Each client has a record of the time the last call was made and the number of calls made in the last minute, as reported by the API, on that call.

If the number of calls is less than 60 (half the limit) the client does not throttle. This allows for fast bursts of small numbers of requests.

Otherwise (ie when there are more previous requests) the client calculates the limiting rate it would need to work at (ie period = 60 / (120 - number of previous requests)) and then waits until the gap between the previous call and the current time exceeds that period (in seconds; 60 seconds in a minute; 120 max requests per minute). This effectively throttles the rate so that, if it were acting alone, it would not exceed the limit.

But the above has problems. If you think it through carefully you'll see that for large numbers of requests a single client oscillates and does not reach maximum throughput (this is partly because of the "initial burst" which will suddenly "fall outside the window" and partly because the algorithm does not make full use of its history). And multiple clients will cooperate to an extent, but I doubt that it is optimal.

Better Solutions

I can imagine a better solution that uses the full local history of the client and models other clients with, say, a Hidden Markov Model. So each client would use the API report to model the other (unknown) clients and adjust its rate accordingly.

I can also imagine an algorithm for a single client that progressively transitions from unlimited behaviour for small bursts to optimal, limited behaviour for many requests without introducing oscillations.

Do such approaches exist? Can anyone provide an implementation or reference? Can anyone think of better heuristics?

I imagine this is a known problem somewhere. In what field? Queuing theory?

I also guess (see comments earlier) that there is no optimal solution and that there may be some lore / tradition / accepted heuristic that works well in practice. I would love to know what... At the moment I am struggling to identify a similar problem in known network protocols (I imagine Perlman would have some beautiful solution if so).

I am also interested (to a lesser degree, for future reference if the program becomes popular) in a solution that requires a central server to aid collaboration.

Disclaimer

This question is not intended to be criticism of Echo Nest at all; their service and conditions of use are great. But the more I think about how best to use this, the more complex/interesting it becomes...

Also, each client has a local cache used to avoid repeating calls.

Updates

Possibly relevant paper.

like image 489
andrew cooke Avatar asked Aug 28 '12 06:08

andrew cooke


2 Answers

The above worked, but was very noisy, and the code was a mess. I am now using a simpler approach:

  • Make a call
  • From the response, note the limit and count
  • Calculate

    barrier = now() + 60 / max(1, (limit - count))**greedy
    
  • On the next call, wait until barrier

The idea is quite simple: that you should wait some length of time proportional to how few requests are left in that minute. For example, if count is 39 and limit is 40 then you wait an entire minute. But if count is zero then you can make a request soon. The greedy parameter is a trade-off - when greater than 1 the "first" calls are made more quickly, but you are more likely hit the limit and end up waiting for 60s.

The performance of this is similar to the approach above, and it's much more robust. It is particularly good when clients are "bursty" as the approach above gets confused trying to estimate linear rates, while this will happily let a client "steal" a few rapid requests when demand is low.

Code here.

like image 197
andrew cooke Avatar answered Nov 05 '22 10:11

andrew cooke


After some experimenting, it seems that the most important thing is getting as good an estimate as possible for the upper limit of the current connection rates.

Each client can track their own (local) connection rate using a queue of timestamps. A timestamp is added to the queue on each connection and timestamps older than a minute are discarded. The "long term" (over a minute) average rate is then found from the first and last timestamps and the number of entries (minus one). The "short term" (instantaneous) rate can be found from the times of the last two requests. The upper limit is the maximum of these two values.

Each client can also estimate the external connection rate (from the other clients). The "long term" rate can be found from the number of "used" connections in the last minute, as reported by the server, corrected by the number of local connections (from the queue mentioned above). The "short term" rate can be estimated from the "used" number since the previous request (minus one, for the local connection), scaled by the time difference. Again, the upper limit (maximum of these two values) is used.

Each client computes these two rates (local and external) and then adds them to estimate the upper limit to the total rate of connections to the server. This value is compared with the target rate band, which is currently set to between 80% and 90% of the maximum (0.8 to 0.9 * 120 per minute).

From the difference between the estimated and target rates, each client modifies their own connection rate. This is done by taking the previous delta (time between the last connection and the one before) and scaling it by 1.1 (if the rate exceeds the target) or 0.9 (if the rate is lower than the target). The client then refuses to make a new connection until that scaled delta has passed (by sleeping if a new connected is requested).

Finally, nothing above forces all clients to equally share the bandwidth. So I add an additional 10% to the local rate estimate. This has the effect of preferentially over-estimating the rate for clients that have high rates, which makes them more likely to reduce their rate. In this way the "greedy" clients have a slightly stronger pressure to reduce consumption which, over the long term, appears to be sufficient to keep the distribution of resources balanced.

The important insights are:

  • By taking the maximum of "long term" and "short term" estimates the system is conservative (and more stable) when additional clients start up.

  • No client knows the total number of clients (unless it is zero or one), but all clients run the same code so can "trust" each other.

  • Given the above, you can't make "exact" calculations about what rate to use, but you can make a "constant" correction (in this case, +/- 10% factor) depending on the global rate.

  • The adjustment to the client connection frequency is made to the delta between the last two connection (adjusting based on the average over the whole minute is too slow and leads to oscillations).

  • Balanced consumption can be achieved by penalising the greedy clients slightly.

In (limited) experiments this works fairly well (even in the worst case of multiple clients starting at once). The main drawbacks are: (1) it doesn't allow for an initial "burst" (which would improve throughput if the server has few clients and a client has only a few requests); (2) the system does still oscillate over ~ a minute (see below); (3) handling a larger number of clients (in the worst case, eg if they all start at once) requires a larger gain (eg 20% correction instead of 10%) which tends to make the system less stable.

plot

The "used" amount reported by the (test) server, plotted against time (Unix epoch). This is for four clients (coloured), all trying to consume as much data as possible.

The oscillations come from the usual source - corrections lag signal. They are damped by (1) using the upper limit of the rates (predicting long term rate from instantaneous value) and (2) using a target band. This is why an answer informed by someone who understand control theory would be appreciated...

It's not clear to me that estimating local and external rates separately is important (they may help if the short term rate for one is high while the long-term rate for the other is high), but I doubt removing it will improve things.

In conclusion: this is all pretty much as I expected, for this kind of approach. It kind-of works, but because it's a simple feedback-based approach it's only stable within a limited range of parameters. I don't know what alternatives might be possible.

like image 44
15 revs Avatar answered Nov 05 '22 10:11

15 revs