I'm trying to build a distributed system to run some performance intensive calculations. One calculation can be done in parallel at multiple worker nodes. The problem is, as the data source keeps changing in real-time, we want each worker node (during a single calculation) to operate on the same "version" of data, i.e. a point-in-time snapshot of the database. This is to avoid inconsistent results.
Another issue is, the entire set of input data per calculation can be very large, so currently we keep a local cache at each worker node, which refreshes the content periodically by asking the data source for "diffs" since the current local cache version and applies the diffs to the local cache.
What are some design strategies to achieve the requirement that each worker node sees the same "version" of data (while still have reasonably fresh data)? I have thought about a solution below but wanted to see if this is a common pattern that has been solved:
Some estimated parameters of the system:
Number of workers: 10
Average job duration: obviously we want this to be as fast as possible, but let's say it should be less than 2 minutes
Input data for a job (overall for all workers): ~100GB
Size of the database: ~1TB
If you are not tied to MySQL and could to use Oracle there is a simple solution for you:
Oracle Flashback
(I have not found MySQL flashback yet comment please if you know some motor for this.) You do not have to create a manual snapshot etc. You could use this with a single database server and all of your processes could read the data as it was represented in the requiried time. This solution is pretty clean and robust but requires licences.
If I were you I would try to take a step back and try to simplify the problem a bit more. If the different workers could run parallel the following should apply:
If both of those requirement is valid you could use a single database to store the calculations etc. The only thing You have to care about is that the transactions should be carefully planned.
On the other hand on a simmilar project we used a small trick to achive this (as the flashback solution): there was the insertion time stored in the database too. (And updates were actually inserts with new timestamps.) All of the calculations etc were made on accurate records by adding to the query the
give me the last version of this kind of row before x timestamp
With this solution we avoided the licence costs and the snapshot maintanence. The only problem with this, if you do not need the whole history it will eat your database space pertty fast. To solve this we made a cron job that is clearing off the unused records based on the timestamp.
If you want to get more, there is something called shadow tables. There is a nice MySQL blog post on this topic: http://arnab.org/blog/shadow-tables-using-mysql-triggers
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With