Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Architecture for fast globally distributed user quota management

We have build a free globally distributed mobility analytics REST API. Meaning we have servers all over the world which run different versions (USA, Europe, etc..) of the same application. The services are behind a load balancer so I can't guarantee that the same user always get's the same application/server if he/she does requests today or tomorrow. The API is public but users have to provide an API key in order for us to match them to their paid request quota.

Since we do heavy number crunching with every request, we want to minimize request times as far as possible, inparticular for authentication/authorization and quota monitoring. Since we currently only use one user database (which has to be located in a single data center) there are cases where users in the US make a request to an application/server in the US which authenticates the user in Europe. So we are looking for a solution where the user database interaction:

  1. happens on the same application server
  2. get's synchronized between all application servers
  3. should be easily integrable into java application
  4. should be fast (changes happen in every request)

Things we have done so far:

  • a single database on each server > not synchronized, nightmare
  • a single database for all servers > ok, when used with slave as fallback but American users have to authenticate over the Atlantic
  • started installing bdr but failed on the way (no time, too complex, hard to make transition)
  • looked at redis.io

Since this is my first globally distributed REST API I wonder how other companies do this. (yelp, Google, etc.)

Any feedback is very kindly appreciated,

Cheers,

Daniel

like image 661
Daniel Gerber Avatar asked Jun 14 '16 11:06

Daniel Gerber


1 Answers

There is no right answer, there are several ways to perform this. I'll describe the one I'm most familiar with (and which probably is one of the simplest ways to do it, although probably not the most robust one).

1. Separate authentication and authorization

First of all separate authentication and authorization. An user authenticating across the Atlantic is fine, every user request that requires authorization to go across the Atlantic is not. Main user credentials (e.g. a password hash) are a centralized resource, there is no way around that. But you do not need the main credentials every time a request needs to be authorized.

This is how a company I worked at did it (although this was postgres/python/django, nothing to do with java):

  • Each server has a cache of database queries in memcached but it does not cache user authentication (memcached is very similar to redis, which you mention).
  • The authentication is performed in the main data centre, always.
  • A successful authentication produces a user session that expires after a day.
  • The session is cached.

This may produce a situation in which a user can have an action authorized by more than one session at a given time. The algorithm is that: if at least one user session has not expired, the user is authorized. Since the cache is local, there is no need for a costly operation of fetching data across the Atlantic for every request.

2. Session revival

Sessions expiring after exactly a day may be annoying for the user, since he can never be sure that his session will not expire in the next couple of seconds. Each request the user makes that is authorized by a session shall extend the session lifetime to a full day again.

This is easy to implement: You need to keep in the session the timestamp of the last request made. And count the session lifetime based on that timestamp. If more than a single session may authorize a request the algorithm shall select the younger session (with the longest lifetime remaining) to update.

3. Request logging

This is as far as the live system I worked with went. The remainder of the answer is about how I would extend such a system to make space for logging the requests and verifying request quota.

Assumptions

Lets start by making a couple of design assumptions and argue why they are good assumptions. You're now running a distributed system, since not all data on the system is in a single place. A distributed system shall give priority to response speed and be as horizontal (not hierarchical) as possible, to achieve this consistency is sacrificed.

The session mechanism above already sacrifices some consistency. For example, if a user logs in and talks continuously to server A for a day and half and then a load balancer points the user to server B, the user may be surprised that he needs to login again. That is an unlikely situation: in most cases a load balancer would divide the user's requests equally over both server A and server B over the course of that day and half. And both, server A and server B will have live sessions at that point.

In your question you wonder how google deals with request counting. And I'll argue here that it deals with it in an inconsistent way. By that, I mean the fact that you cannot enforce a strict upper limit of requests if you're sacrificing consistency. Companies like google or yelp simply say:

"You have 12000 requests per month but if you do more than that you will pay $0.05 for every 10 requests above that".

That allows for an easier design: you can count requests at any time after they happened, the counting does not need to happen in real time.

One last assumption: a distributed system will have problems with duplicated internal data. This happens because parts of the system are running in real time and parts are doing batch processing without stopping or timestamping the real time system. That happens because you cannot be 100% sure about the state of the real time system at any given point. It is mandatory that every request coming from a customer have a unique identifier of some sort, it can be as simple as customer number + sequence number, but it needs to exist in every request. You may also add such a unique identifier the moment you receive the request.

Design

Now we extend our user session that is cached on every server (often cached in a different state on different servers that are unaware of each other). For every customer request we store the request's unique identifier as part of the session cache. That's right, the request counter in the central database is not updated in real time.

At a point in time (end of day processing, for example) each server performs a batch processing of request identifiers:

  • Live sessions are duplicated, and the copies are expired.
  • All request identifiers in expired sessions are concatenated together, and the central database is written to.
  • All expired sessions are purged.

This produces a race condition, where a live session may receive requests whilst the server is talking to the central database. For that reason we do not purge request identifiers from live sessions. This, in turn, causes a problem where a request identifier may be logged to the central database twice. And it is for that reason that we need the identifiers to be unique, the central database shall ignore request updates with identifiers already logged.

Advantages

  • 99.9% Uptime, the batch processing does not interrupt the real time system.
  • Reduced writes to the database, and reduced communication with the database in general.
  • Easy horizontal growth.

Disadvantages

  • If a server goes down, recovering the requests performed may be tricky.
  • There is no way to stop a customer from doing more requests than he is allowed to (that's a peculiarity of distributed systems).
  • Need to store unique identifiers for all requests, a counter is not enough to measure the number of requests.

Invoicing the user does not change, you just query the central database and see how many requests a customer performed.

like image 88
grochmal Avatar answered Nov 19 '22 11:11

grochmal