There is frequently the need to synchronize data from master tables in one database to clone tables in other databases, often on other servers. For example, consider the case where a backend system manages inventory data and that inventory data ultimately must be pushed to one or more databases that are part of a web site application.
The source data in the backend system is heavily normalized, with dozens of tables and foreign key constraints. It is a well-designed OLTP RDBMS system. Many of the tables in question contain millions of rows. The need is to push this data out to the other databases regularly. As frequently as feasible; latency can be tolerated. Above all, maximum uptime of both the backend and remote databases is imperative.
I am using SQL Server and am familiar with change tracking, rowversion, triggers, and so on. I know that Microsoft pushes replication, SyncFx, and SSIS heavily for these scenarios. However, there is quite a difference between vendor whitepapers and overviews recommending technologies and the actual implementation, deployment, and maintenance of the solution. In the SQL Server world, replication is often viewed as the turnkey solution, but I am trying to explore alternate solutions. (There is some fear that replication is difficult to administer, makes it hard to change schema, and in the event that a re-initialize is ever required there would be large downtime for critical systems.)
There are lots of gotchas. Due to the complex foreign key relationships among large numbers of tables, determining what order to perform captures or to apply updates is not trivial. Due to unique indexes, two rows might be interlocked in such a way that row-at-a-time update will not even work (need to perform intermediate updates to each row before the final update). These are not necessarily show-stoppers, as unique indexes can often be changed to regular indexes and foreign keys can be disabled (though disabling the foreign keys is extremely undesirable). Often, you will hear, "just" use SQL 2008 change tracking and SSIS or SyncFx. These kinds of answers really do not do justice to the practical difficulties. (And of course, clients really have a hard time wrapping their heads over how copying data could be so difficult, making a difficult situation all the worse!)
This issue is ultimately very generic: perform one-way synchronization of many heavily related database tables with lots of rows. Almost everyone involved in databases has to deal with this kind of issue. Whitepapers are common, practical expertise hard to find. We know this can be a difficult issue, but the job must get done. Let's hear about what has worked for you (and what to avoid). Tell your experience with Microsoft products or products from other vendors. But if you personally have not battle-tested the solution with large numbers of heavily-related tables and rows, please refrain from answering. Let's keep this practical -- not theoretical.
Better ask on serverfault.com (I can't post comments, scripts are broken in SO, so I have to post a full answer)
Update: (switched to Safari, scripts work again, I can post properly)
There is no silver bullet. For ease of use and 'one key turn' deployment nothing can beat replication. Is the only solution that covers deeply conflict detection and resolution, has support for pushing schema changes and comes with a comprehensive set of tools for setting it up and monitoring it. It has been the MS poster child of data synchronization for many years before this 'agenda' was taken over by the .Net crowd. Replication has two underlying problems in my opinion:
At the other end of the spectrum you'll find solutions that truly address the problem of application communication, like services based on queued messaging. But are either painfully slow and riddled with problems rooted in the separation of the communication mechanism (web services and or msmq) and the data storage (DTC transactions between comm and db, no common high availability story, no common recoverability story etc etc). Solutions that are blazingly fast and fully integrated with DB exists in the MS stack, but nobody knows how to use them. Somewhere in between these and replication you'll find various intermediate solutions, like OCS/Synch framework and SSIS based custom solutions. None will offer the ease of setup and monitoring of replication, but they might scale and perform better.
I was involved with several projects that required 'data synchronization' on a very large scale (+1200 sites, +1600 sites) and my solution was to turn the problem on a 'application communication' problem. Once the mindset is changed to this and the data flow is no longer seen as 'record with key X of table Y' but instead 'message communicating the purchase of item X by customer Y' the solution becomes easier to understand and apply. You no longer think in terms of 'insert records in order X-Y-Z so FK relations don't break' but instead in terms of 'process purchase as described by message XYZ'.
In my view replication, and it derivatives (ie. data tracking and data-gram shipping), are solutions anchored in the '80 technologies and view of the data/applications. Obsolete dinosaurs (and by no way turning into birds).
I know this does not even begin to address all your (very legit) concerns, but writing out all I have to say/rant/rable on this topic would fill volumes of paperback...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With