We have build a free globally distributed mobility analytics REST API. Meaning we have servers all over the world which run different versions (USA, Europe, etc..) of the same application. The services are behind a load balancer so I can't guarantee that the same user always get's the same application/server if he/she does requests today or tomorrow. The API is public but users have to provide an API key in order for us to match them to their paid request quota.
Since we do heavy number crunching with every request, we want to minimize request times as far as possible, inparticular for authentication/authorization and quota monitoring. Since we currently only use one user database (which has to be located in a single data center) there are cases where users in the US make a request to an application/server in the US which authenticates the user in Europe. So we are looking for a solution where the user database interaction:
Things we have done so far:
Since this is my first globally distributed REST API I wonder how other companies do this. (yelp, Google, etc.)
Any feedback is very kindly appreciated,
Cheers,
Daniel
There is no right answer, there are several ways to perform this. I'll describe the one I'm most familiar with (and which probably is one of the simplest ways to do it, although probably not the most robust one).
First of all separate authentication and authorization. An user authenticating across the Atlantic is fine, every user request that requires authorization to go across the Atlantic is not. Main user credentials (e.g. a password hash) are a centralized resource, there is no way around that. But you do not need the main credentials every time a request needs to be authorized.
This is how a company I worked at did it (although this was postgres
/python
/django
, nothing to do with java
):
memcached
but it does not cache user authentication (memcached
is very similar to redis
, which you mention).This may produce a situation in which a user can have an action authorized by more than one session at a given time. The algorithm is that: if at least one user session has not expired, the user is authorized. Since the cache is local, there is no need for a costly operation of fetching data across the Atlantic for every request.
Sessions expiring after exactly a day may be annoying for the user, since he can never be sure that his session will not expire in the next couple of seconds. Each request the user makes that is authorized by a session shall extend the session lifetime to a full day again.
This is easy to implement: You need to keep in the session the timestamp
of the last request made. And count the session lifetime based on that timestamp
. If more than a single session may authorize a request the algorithm shall select the younger session (with the longest lifetime remaining) to update.
This is as far as the live system I worked with went. The remainder of the answer is about how I would extend such a system to make space for logging the requests and verifying request quota.
Lets start by making a couple of design assumptions and argue why they are good assumptions. You're now running a distributed system, since not all data on the system is in a single place. A distributed system shall give priority to response speed and be as horizontal (not hierarchical) as possible, to achieve this consistency is sacrificed.
The session mechanism above already sacrifices some consistency. For example, if a user logs in and talks continuously to server A
for a day and half and then a load balancer points the user to server B
, the user may be surprised that he needs to login again. That is an unlikely situation: in most cases a load balancer would divide the user's requests equally over both server A
and server B
over the course of that day and half. And both, server A
and server B
will have live sessions at that point.
In your question you wonder how google deals with request counting. And I'll argue here that it deals with it in an inconsistent way. By that, I mean the fact that you cannot enforce a strict upper limit of requests if you're sacrificing consistency. Companies like google or yelp simply say:
"You have 12000 requests per month but if you do more than that you will pay $0.05 for every 10 requests above that".
That allows for an easier design: you can count requests at any time after they happened, the counting does not need to happen in real time.
One last assumption: a distributed system will have problems with duplicated internal data. This happens because parts of the system are running in real time and parts are doing batch processing without stopping or timestamping the real time system. That happens because you cannot be 100% sure about the state of the real time system at any given point. It is mandatory that every request coming from a customer have a unique identifier of some sort, it can be as simple as customer number + sequence number, but it needs to exist in every request. You may also add such a unique identifier the moment you receive the request.
Now we extend our user session that is cached on every server (often cached in a different state on different servers that are unaware of each other). For every customer request we store the request's unique identifier as part of the session cache. That's right, the request counter in the central database is not updated in real time.
At a point in time (end of day processing, for example) each server performs a batch processing of request identifiers:
This produces a race condition, where a live session may receive requests whilst the server is talking to the central database. For that reason we do not purge request identifiers from live sessions. This, in turn, causes a problem where a request identifier may be logged to the central database twice. And it is for that reason that we need the identifiers to be unique, the central database shall ignore request updates with identifiers already logged.
Invoicing the user does not change, you just query the central database and see how many requests a customer performed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With