Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best Practice for synchronizing common distributed data

I have a internet application that supports offline mode where users might create data that will be synchronized with the server when the user comes back online. So because of this I'm using UUID's for identity in my database so the disconnected clients can generate new objects without fear of using an ID used by another client, etc. However, while this works great for objects that are owned by this user there are objects that are shared by multiple users. For example, tags used by a user might be global, and there's no possible way the remote database could hold all possible tags in the universe.

If an offline user creates an object and adds some tags to it. Let's say those tags don't exist on the user's local database so the software generates a UUID for them. Now when those tags are synchronized there would need to be resolution process to resolve any overlap. Some way to match up any existing tags in the remote database with the local versions.

One way is to use some process by which global objects are resolved by a natural key (name in the case of a tag), and the local database has to replace it's existing object with this the one from the global database. This can be messy when there are many connections to other objects. Something tells me to avoid this.

Another way to handle this is to use two IDs. One global ID and one local ID. I was hoping using UUIDs would help avoid this, but I keep going back and forth between using a single UUID and using two split IDs. Using this option makes me wonder if I've let the problem get out of hand.

Another approach is to track all changes through the non-shared objects. In this example, the object the user assigned the tags. When the user synchronizes their offline changes the server might replace his local tag with the global one. The next time this client synchronizes with the server it detects a change in the non-shared object. When the client pulls down that object he'll receive the global tag. The software will simply resave the non-shared object pointing it to the server's tag and orphaning his local version. Some issues with this are extra round trips to fully synchronize, and extra data in the local database that is just orphaned. Are there other issues or bugs that could happen when the system is in between synchronization states? (i.e. trying to talk to the server and sending it local UUIDs for objects, etc).

Another alternative is to avoid common objects. In my software that could be an acceptable answer. I'm not doing a lot of sharing of objects across users, but that doesn't mean I'd NOT be doing it in the future. Which means choosing this option could paralyze my software in the future should I need to add these types of features. There are consequences to this choice, and I'm not sure if I've completely explored them.

So I'm looking for any sort of best practice, existing algorithms for handling this type of system, guidance on choices, etc.

like image 398
chubbsondubs Avatar asked Aug 12 '09 12:08

chubbsondubs


3 Answers

Depend on what application semantics you want to offer to users, you may pick different solutions. E.g., if you are actually talking about tagging objects created by an offline user with a keyword, and wanting to share the tags across multiple objects created by different users, then using "text" for the tag is fine, as you suggested. Once everyone's changes are merged, tags with the same "text", like, say "THIS IS AWESOME", will be shared.

There are other ways to handle disconnected updates to shared objects. SVN, CVS, and other version control system try to resolve conflicts automatically, and when cannot, will just tell user there is a conflict. You can do the same, just tell user there have been concurrent updates and the users have to handle resolution.

Alternatively, you can also log updates as units of change, and try to compose the changes together. For example, if your shared object is a canvas, and your application semantics allows shared drawing on the same canvas, then a disconnected update that draws a line from point A to point B, and another disconnected update drawing a line from point C to point D, can be composed. In this case, if you keep those two updates as just two operations, you can order the two updates and on re-connection, each user uploads all its disconnected operations and applies missing operations from other users. You probably want some kind of ordering rule, perhaps based on version number.

Another alternative: if updates to shared objects cannot be automatically reconciled, and your application semantics does not support notifying user and asking user to resolve conflicts due to disconnected updates, then you can also use version tree to handle this. Each update to a shared object creates a new version, with past version as the parent. When there are disconnected updates to a shared object from two different users, two separate children versions/leaf nodes result from the same parent version. If your application's internal representation of state is this version tree, then your application's internal state remains consistent despite disconnected updates, and you can handle the two branches of the version tree in some other way (e.g. letting user know of branches and create tools for them to merge branches, as in source control systems).

Just a few options. Hope this helps.

like image 182
OverClocked Avatar answered Oct 05 '22 22:10

OverClocked


Your problem is quite similar to versioning systems like SVN. You could take example from those.

Each user would have a set of personal objects plus any shared objects that they need. Locally, they will work as if they own the all the objects.

During sync, the client would first download any changes in the objects, and automatically synchronize what is obvious. In your example, if there is a new tag coming from the server with the same name, then it would update the UUID correspondingly on the local system.

This would also be a nice place in which to detect and handle cases like data committed from another client, but by the same user.

Once the client has an updated and merged version of the data, you can do an upload.

There will be to round trips, but I see no way of doing this without overcomplicating the data structure and having potential pitfalls in the way you do the sync.

like image 31
Sklivvz Avatar answered Oct 06 '22 00:10

Sklivvz


As a totally out of left-field suggestion, I'm wondering if using something like CouchDB might work for your situation. Its replication features could handle a lot of your online/offline synchronisation problems for you, including mechanisms to allow the application to handle conflict resolution when it arises.

like image 23
Evan Avatar answered Oct 05 '22 22:10

Evan