Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strategy to keep local cache see the same "version" of data in a distributed system

I'm trying to build a distributed system to run some performance intensive calculations. One calculation can be done in parallel at multiple worker nodes. The problem is, as the data source keeps changing in real-time, we want each worker node (during a single calculation) to operate on the same "version" of data, i.e. a point-in-time snapshot of the database. This is to avoid inconsistent results.

Another issue is, the entire set of input data per calculation can be very large, so currently we keep a local cache at each worker node, which refreshes the content periodically by asking the data source for "diffs" since the current local cache version and applies the diffs to the local cache.

What are some design strategies to achieve the requirement that each worker node sees the same "version" of data (while still have reasonably fresh data)? I have thought about a solution below but wanted to see if this is a common pattern that has been solved:

  • Build a "versioning" service that periodically queries the data source for diffs and store each diff as a data "version". The worker node's caches sync with the versioning service and also keep its cached data at multiple versions. For one calculation, we make sure that the worker nodes use input data at the same version to achieve consistency. This versioning service should also keep the latest copy of the entire data set for the worker node to load its cache initially, and to restore the local cache content if a worker node goes down and goes back up.

Some estimated parameters of the system:

  • Number of workers: 10

  • Average job duration: obviously we want this to be as fast as possible, but let's say it should be less than 2 minutes

  • Input data for a job (overall for all workers): ~100GB

  • Size of the database: ~1TB

like image 453
Pinch Avatar asked Jul 29 '16 19:07

Pinch


1 Answers

If you are not tied to MySQL and could to use Oracle there is a simple solution for you:

Oracle Flashback

(I have not found MySQL flashback yet comment please if you know some motor for this.) You do not have to create a manual snapshot etc. You could use this with a single database server and all of your processes could read the data as it was represented in the requiried time. This solution is pretty clean and robust but requires licences.

If I were you I would try to take a step back and try to simplify the problem a bit more. If the different workers could run parallel the following should apply:

  • None of the workers use the output of the others
  • None of them is altering the original data

If both of those requirement is valid you could use a single database to store the calculations etc. The only thing You have to care about is that the transactions should be carefully planned.

On the other hand on a simmilar project we used a small trick to achive this (as the flashback solution): there was the insertion time stored in the database too. (And updates were actually inserts with new timestamps.) All of the calculations etc were made on accurate records by adding to the query the

give me the last version of this kind of row before x timestamp

With this solution we avoided the licence costs and the snapshot maintanence. The only problem with this, if you do not need the whole history it will eat your database space pertty fast. To solve this we made a cron job that is clearing off the unused records based on the timestamp.

If you want to get more, there is something called shadow tables. There is a nice MySQL blog post on this topic: http://arnab.org/blog/shadow-tables-using-mysql-triggers

like image 103
Hash Avatar answered Nov 15 '22 10:11

Hash