Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Synchronize two lists of objects

Problem

I have two lists of objects. Each object contains the following:

  • GUID (allows to determine if objects are the same — from business point of view)
  • Timestamp (updates to current UTC each time the object changed)
  • Version (positive integer; increments each time the object changed)
  • Deleted (boolean flag; switches to "true" instead of actual object deleting)
  • Data (some useful payload)
  • Any other fields if need

Next, I need to sync two lists according to these rules:

  • If object with some GUID presented only in one list, it should be copied to another list
  • If object with some GUID presented in both lists, the instance with less Version should be replaced with one having greater Version (nothing to do if versions are equal)

Real-world requirements:

  • Each list has 50k+ objects, each object is about 1 Kb
  • Lists are placed on different machines connected via Internet (e.g., mobile app and remote server), thus, algorithm shouldn't waste the traffic or CPU much
  • Most of time (say, 96%) lists are already synced before sync process, hence, the algorithm should determine it with minimal effort
  • If there are any differences, most of time they are pretty small (3-5 objects changed/added)
  • Should proceed OK if one list is empty (and other still has 50k+ items)

Solution #1 (currently implemented)

  1. Client stores the time-of-last-sync-succeed (say T)
  2. Both lists are asked for all objects having Timestamp > T (i.e. recently modified; in the production it's ... > (T - day) for better robustness)
  3. These lists of recently modified objects are synced naively:
    • items presented only in first list are saved to second list
    • items presented only in second list are saved to first list
    • other items has their Version's compared and saved to appropriative list (if need)

Procs:

  • Works great with small changes
  • Almost fits the requirements

Cons:

  • Depends on T, which makes the algorithm fragile: it's easy to sync last updates, but hard to make sure lists are completely synced (using minimal T like 1970-01-01 just hangs the sync process)

My questions:

  1. Is there any common / best-practice / proved way to sync object lists?
  2. Is there any better [than #1] solutions for my case?

P.S. Already viewed, not duplicates:

  • Compare Two List Of Objects For Synchronization
  • Two list synchronization
like image 955
Nikita Bosik Avatar asked Oct 27 '14 14:10

Nikita Bosik


3 Answers

Summary

All answers has some worth points. To summarize, here is the compiled answer I was looking for, based on finally implemented working sync system:

  1. In general, use Merkle trees. They are dramatically efficient in comparing large amounts of data.

  2. If you can, rebuild your hash tree from scratch every time you need it. Check the time required to rebuild hash tree. Most likely it's pretty fast (e.g., in my case on Nexus 4 rebuilding tree for 20k items takes ~2 sec: 1.8 sec for fetching data from DB + 0.2 sec for building tree; the server performs ~20x faster), so you don't need to store the tree in the DB and maintain it when data changed (my first try was rebuilding only relevant branches — it's not too complicated to implement, but is very fragile).

  3. Nevertheless, it's ok to cache and reuse tree if no data modifications was done at all. Once modification happened, invalidate the whole cache.

Technical details

  • GUID is 32 chars long without any hyphens/braces, lowercase;
  • I use 16-ary tree with the height of 4, where each branch is related to the GUID's char. It may be implemented as actual tree or map:
    • 0000 → (hash of items with GUID 0000*)
    • 0001 → (hash of items with GUID 0001*)
    • ...
    • ffff → (hash of items with GUID ffff*);
    • 000 → (hash of hashes 000_)
    • ...
    • 00 → (hash of hashes 00_)
    • ...
    • () → (root hash, i.e. hash of hashes _)

Thus, the tree has 65536 leafs and requires 2 Mb of memory; each leaf covers ~N/65536 data items. Binary trees would be 2x more efficient in terms of memory, but it's a harder to implement.

  • I had to implement these methods:

    • getHash() — returns root hash; used for primary check (as mentioned, in 96% that's all we need to test);
    • getHashChildren(x) — returns list of hashes x_ (at most 16); used for effective, single-request discovering data difference;
    • findByIdPrefix(x) — returns items with GUID x*, x must contain exactly 4 chars; used for requesting leaf items;
    • count(x) — returns number of items with GUID x*; when reasonably small, we can dismiss checking tree branch-by-branch and transfer bunch of items with single request;
  • As far as syncing is done per-branch transmitting small amounts of data, it's very responsive (you can check the progress at any time) + very robust for unexpected terminating (e.g., due to network failure) and easily restarts from the last point if need.

  • IMPORTANT: sometimes you will stuck with conflicting state: {version_1 = version_2, but hash_1 != hash_2}: in this case you must make some decision (maybe with user's help or comparing timestamps as last resort) and rewrite some item with another to resolve the conflict, otherwise you'll end up with unsynced and unsyncable hash trees.

Possible improvements

  • Implement transmitting (GUID, Version) pairs without payload for lightweighting requests.
like image 143
Nikita Bosik Avatar answered Nov 12 '22 23:11

Nikita Bosik


Two suggestions come to mind, the first one is possibly something you're doing already:

1) Don't send entire lists of items with timestamps > T. Instead, send a list of (UUID, Version) tuples of objects with timestamps > T. Then the other side can figure out which objects it needs to update from that. Send the UUIDs of those back to request the actual objects. This avoids sending full objects if they have timestamp > T, but are nonetheless newer already (or present already with the latest Version) on the other side.

2) Don't process the full list at once, but in chunks, i.e. first sync 10%, then the next 10% etc. to avoid transferring too much data at once for big syncs (and to allow for restarting points if a connection should break). This can be done by e.g. starting with all UUIDs with a checksum equivalent to 1 modulo 10, then 1 modulo 10 etc.

Another possibility would be proactive syncing, e.g. asynchronously posting chances, possibly via UCP (unreliable as opposed to TCP). You would still need to sync when you need current information, but chances are most of it is current.

like image 31
Lying Dog Avatar answered Nov 13 '22 00:11

Lying Dog


You need to store not time of last synchronization, but the state of the objects (eg. the hash of object data) at time of last synchronization. Then you compare each list with the stored list and find, what objects have changed on each side.

This is much more reliable than rely on time, cause time requires that both sides have synchronized timer which gives precise time (and this is not the case on most systems). For the same reason your idea of detecting changes based on time + version can be more error-prone than it initially seems.

Also you don't initially transfer object data but only GUIDs.

BTW we've made a framework (free with source) which addresses exactly your problems. I am not giving the link because some alternatively talented people would complain.

like image 40
Eugene Mayevski 'Callback Avatar answered Nov 13 '22 00:11

Eugene Mayevski 'Callback