Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the standard algorithm for syncing two lists of related objects?

I'm pretty sure this must be in some kind of text book (or more likely in all of them) but I seem to be using the wrong keywords to search for it... :(

A recurring task I'm facing while programming is that I am dealing with lists of objects from different sources which I need to keep in sync somehow. Typically there's some sort of "master list" e.g. returned by some external API and then a list of objects I create myself each of which corresponds to an object in the master list (think "wrappers" or "adapters" - they typically contain extended information about the external objects specific to my application and/or they simplify access to the external objects).

Hard characteristics of all instances of the problem:

  • the implementation of the master list is hidden from me; its interface is fixed
  • the elements in the two lists are not assignment-compatible
  • I have full control over the implementation of the slave list
  • I cannot control the order of elements in the master list (i.e. it's not sortable)
  • the master list does either not provide notification about added or removed elements at all or notification is unreliable, i.e. the sync can only happen on-demand, not live
  • simply clearing and rebuilding the slave list from scratch whenever it's needed is not an option:
    • initializing the wrapper objects should be considered expensive
    • other objects will hold references to the wrappers

Additional characteristics in some instances of the problem:

  • elements in the master list can only be identified by reading their properties rather than accessing them directly by index or memory address:
    • after a refresh, the master list might return a completely new set of instances even though they still represent the same information
    • the only interface for accessing elements in the master list might be a sequential enumerator
  • most of the time, the order of elements in the master list is stable, i.e. new elements are always added either at the beginning or at the end, never in the middle; however, deletion can usually occur at any position

So how would I typically tackle this? What's the name of the algorithm I should google for?

In the past I have implemented this in various ways (see below for an example) but it always felt like there should be a cleaner and more efficient way, especially one that did not require two iterations (one over each list).

Here's an example approach:

  1. Iterate over the master list
  2. Look up each item in the "slave list"
  3. Add items that do not yet exist
  4. Somehow keep track of items that already exist in both lists (e.g. by tagging them or keeping yet another list)
  5. When done, iterate over the slave list and remove all objects that have not been tagged (see 4.) and clear the tag again from all others

Update 1 Thanks for all your responses so far! I will need some time to look at the links.
[...] (text moved to main body of question)

Update 2 Restructered the middle-paragraph into a (hopefully) more easily parseable bullet lists and incorporated details added later in the first update.

like image 579
Oliver Giesen Avatar asked Dec 18 '09 14:12

Oliver Giesen


1 Answers

The 2 typical solutions are: 1. Copy the master list to the sync list. 2. Do an O(N*N) comparison between all element pairs.

You've excluded the smart options already: shared identity, sorting and change notifications.

Note that it's not relevant whether the lists can be sorted in a meaningful way, or even completely. For instance, when comparing two string lists, it would be ideal to sort alphabetically. But the list comparison would still be more efficient if you'd sort both lists by string length! You'd still have to do a full pairwise comparison of strings of the same length, but that will probably be a much smaller nummber of pairs.

like image 84
MSalters Avatar answered Oct 15 '22 14:10

MSalters