Problem
I have two lists of objects. Each object contains the following:
GUID
(allows to determine if objects are the same — from business
point of view)Timestamp
(updates to current UTC each time the
object changed)Version
(positive integer; increments each time
the object changed)Deleted
(boolean flag; switches to "true" instead
of actual object deleting)Data
(some useful payload)Next, I need to sync two lists according to these rules:
GUID
presented only in one list, it should be copied to another listGUID
presented in both lists, the instance with less Version
should be replaced with one having greater Version
(nothing to do if versions are equal)Real-world requirements:
Solution #1 (currently implemented)
T
)Timestamp
> T
(i.e. recently modified; in the production it's ... > (T
- day) for better robustness)Version
's compared and saved to appropriative list (if need)Procs:
Cons:
T
, which makes the algorithm fragile: it's easy to sync last updates, but hard to make sure lists are completely synced (using minimal T
like 1970-01-01 just hangs the sync process)My questions:
P.S. Already viewed, not duplicates:
All answers has some worth points. To summarize, here is the compiled answer I was looking for, based on finally implemented working sync system:
In general, use Merkle trees. They are dramatically efficient in comparing large amounts of data.
If you can, rebuild your hash tree from scratch every time you need it. Check the time required to rebuild hash tree. Most likely it's pretty fast (e.g., in my case on Nexus 4 rebuilding tree for 20k items takes ~2 sec: 1.8 sec for fetching data from DB + 0.2 sec for building tree; the server performs ~20x faster), so you don't need to store the tree in the DB and maintain it when data changed (my first try was rebuilding only relevant branches — it's not too complicated to implement, but is very fragile).
Nevertheless, it's ok to cache and reuse tree if no data modifications was done at all. Once modification happened, invalidate the whole cache.
0000
→ (hash of items with GUID 0000*
)0001
→ (hash of items with GUID 0001*
)ffff
→ (hash of items with GUID ffff*
);000
→ (hash of hashes 000_
)00
→ (hash of hashes 00_
)_
)Thus, the tree has 65536 leafs and requires 2 Mb of memory; each leaf covers ~N/65536 data items. Binary trees would be 2x more efficient in terms of memory, but it's a harder to implement.
I had to implement these methods:
getHash()
— returns root hash; used for primary check (as mentioned,
in 96% that's all we need to test);getHashChildren(x)
— returns list of hashes x_
(at most 16); used for effective, single-request discovering data difference;findByIdPrefix(x)
— returns items with GUID x*
, x must contain exactly 4 chars; used for requesting leaf items;count(x)
— returns number of items with GUID x*
; when reasonably small, we can dismiss checking tree branch-by-branch and transfer bunch of items with single request;As far as syncing is done per-branch transmitting small amounts of data, it's very responsive (you can check the progress at any time) + very robust for unexpected terminating (e.g., due to network failure) and easily restarts from the last point if need.
version_1
= version_2
, but hash_1
!= hash_2
}: in this case you must make some decision (maybe with user's help or comparing timestamps as last resort) and rewrite some item with another to resolve the conflict, otherwise you'll end up with unsynced and unsyncable hash trees.(GUID, Version)
pairs without payload for lightweighting requests.Two suggestions come to mind, the first one is possibly something you're doing already:
1) Don't send entire lists of items with timestamps > T. Instead, send a list of (UUID, Version) tuples of objects with timestamps > T. Then the other side can figure out which objects it needs to update from that. Send the UUIDs of those back to request the actual objects. This avoids sending full objects if they have timestamp > T, but are nonetheless newer already (or present already with the latest Version) on the other side.
2) Don't process the full list at once, but in chunks, i.e. first sync 10%, then the next 10% etc. to avoid transferring too much data at once for big syncs (and to allow for restarting points if a connection should break). This can be done by e.g. starting with all UUIDs with a checksum equivalent to 1 modulo 10, then 1 modulo 10 etc.
Another possibility would be proactive syncing, e.g. asynchronously posting chances, possibly via UCP (unreliable as opposed to TCP). You would still need to sync when you need current information, but chances are most of it is current.
You need to store not time of last synchronization, but the state of the objects (eg. the hash of object data) at time of last synchronization. Then you compare each list with the stored list and find, what objects have changed on each side.
This is much more reliable than rely on time, cause time requires that both sides have synchronized timer which gives precise time (and this is not the case on most systems). For the same reason your idea of detecting changes based on time + version can be more error-prone than it initially seems.
Also you don't initially transfer object data but only GUIDs.
BTW we've made a framework (free with source) which addresses exactly your problems. I am not giving the link because some alternatively talented people would complain.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With