Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Synchronize two ordered lists

We have two offline systems that normally can not communicate with each other. Both systems maintain the same ordered list of items. Only rarely will they be able to communicate with each other to synchronize the list.

Items are marked with a modification timestamp to detect edits. Items are identified by UUIDs to avoid conflicts when inserting new items (as opposed to using auto-incrementing integers). When synchronizing new UUIDs are detected and copied to the other system. Likewise for deletions.

The above data structure is fine for an unordered list, but how can we handle ordering? If we added an integer "rank", that would need renumbering when inserting a new item (thus requiring synchronizing all successor items due to only 1 insertion). Alternatively, we could use fractional ranks (use the average of the ranks of the predecessor and successor item), but that doesn't seem like a robust solution as it will quickly run into accuracy problems when many new items are inserted.

We also considered implementing this as a doubly linked-list with each item holding the UUID of its predecessor and successor item. However, that would still require synchronizing 3 items when 1 new items was inserted (or synchronizing the 2 remaining items when 1 item was deleted).

Preferably, we would like to use a data structure or algorithm where only the newly inserted item needs to be synchronized. Does such a data structure exist?

Edit: we need to be able to handle moving an existing item to a different position too!

like image 671
Jason Smith Avatar asked Apr 12 '12 19:04

Jason Smith


People also ask

What are the two types of synchronization?

There are two types of synchronization: full and incremental.

What is data synchronization example?

Let's see a simple example of data synchronization: Suppose we have added a new popular ringtone to one of the servers of a mobile service provider, so here, data synchronization means that all the service provider servers get identical sets of ringtones.

What is data synchronization in IoT?

Data Synchronization (Sync) is the process of establishing consistency and consolidation of data between different devices. It is fundamental to most IT solutions, especially in IoT and Mobile. Data Sync entails the continuous harmonization of data over time and typically is a complex, non-trivial process.

What is synchronization in big data?

Data synchronization is the ongoing process of synchronizing data between two or more devices and updating changes automatically between them to maintain consistency within systems. While the sheer quantity of data afforded by the cloud presents challenges, it also provides the perfect solution for big data.


3 Answers

There is really no problem with the interpolated rank approach. Just define your own numbering system based on variable length bit vectors representing binary fractions between 0 and 1 with no trailing zeros. The binary point is to the left of the first digit.

The only inconvenience of this system is that the minimum possible key is 0 given by the empty bit vector. Therefore you use this only if you're positive the associated item will forever be the first list element. Normally, just give the first item the key 1. That's equivalent to 1/2, so random insertions in the range (0..1) will tend to minimize bit usage. To interpolate an item before and after,

01 < newly interpolated = 1/4
1
11 < newly interpolated = 3/4

To interpolate again:

001 < newly interpolated = 1/8
01
011 < newly interpolated = 3/8
1
101 < newly interpolated = 5/8
11 
111  < newly interpolated = 7/8

Note that if you wish you can omit storing the final 1! All keys (except 0 which you won't normally use) end in 1, so storing it is supefluous.

Comparison of binary fractions is a lot like lexical comparison: 0<1 and the first bit difference in a left-to-right scan tells you which is less. If no differences occur, i.e. one vector is a strict prefix of the other, then the shorter one is smaller.

With these rules it's pretty simple to come up with an algorithm that accepts two bit vectors and computes a result that's roughly (or exactly in some cases) between them. Just add the bit strings, and shift right 1, dropping unnecessary trailing bits, i.e. take the average of the two to split the range between.

In the example above, if deletions had left us with:

01
111

and we need to interpolate these, add 01(0) and and 111 to obtain 1.001, then shift to get 1001. This works fine as an interpolant. But note the final 1 unnecessarily makes it longer than either of the operands. An easy optimization is to drop the final 1 bit along with trailing zeros to get simply 1. Sure enough, 1 is about half way between as we'd hope.

Of course if you do many inserts in the same location (think e.g. of successive inserts at the start of the list), the bit vectors will get long. This is exactly the same phenomenon as inserting at the same point in a binary tree. It grows long and stringy. To fix this, you must "rebalance" during a synchronization by renumbering with the shortest possible bit vectors, e.g. for 14 you'd use the sequence above.

Addition

Though I haven't tried it, the Postgres bit string type seems to suffice for the keys I've described. The thing I'd need to verify is that the collation order is correct.

Also, the same reasoning works just fine with base-k digits for any k>=2. The first item gets key k/2. There is also a simple optimization that prevents the very common cases of appending and prepending elements at the end and front respectively from causing keys of length O(n). It maintains O(log n) for those cases (though inserting at the same place internally can still produce O(p) keys after p insertions). I'll let you work that out. With k=256, you can use indefinite length byte strings. In SQL, I believe you'd want varbinary(max). SQL provides the correct lexicographic sort order. Implementation of the interpolation ops is easy if you have a BigInteger package similar to Java's. If you like human-readable data, you can convert the byte strings to e.g. hex strings (0-9a-f) and store those. Then normal UTF8 string sort order is correct.

like image 177
Gene Avatar answered Oct 20 '22 00:10

Gene


You can add two fields to each item - 'creation timestamp' and 'inserted after' (containing the id of the item after which the new item was inserted). Once you synchronize a list, send all the new items. That information is enough for you to be able to construct the list on the other side.

With the list of newly added items received, do this (on the receiving end): sort by creation timestamp, then go one by one, and use the 'inserted after' field to add the new item in the appropriate place.

You may face trouble if an item A is added, then B is added after A, then A is removed. If this can happen, you will need to sync A as well (basically syncing the operations that took place on the list since the last sync, and not just the content of the current list). It's basically a form of log-shipping.

like image 30
zmbq Avatar answered Oct 19 '22 22:10

zmbq


You could have a look at "lenses", which is bidirectional programming concept. For instance, your problem seems to be solved my "matching lenses", described in this paper.

like image 34
esope Avatar answered Oct 19 '22 22:10

esope