I have large datasets with millions of records in XML format. These datasets are full data dumps of a database up to a certain point in time. Between two dumps new entries might have been added and existing ones might have been modified or deleted. Assume the schema remains unchanged and that every entry has a unique ID. What would be the best way to determine the delta between two of these datasets (including deletions and updates)? <hr> My plan is to load everything to an RDBMS and go from there. First, load the older dump. Then, load the newer dump into a different schema, but in doing so I'll check if the entry is new or is an update to an existing entry. If yes, I'll log the ID on a new table(s) called "changes." After this is all done, I'll go through the old dump going through all entries and see if they have a matching record (ie: same ID) on the new dump. If not, log to changes. Assuming looking up a record by ID is a <code>O(log n)</code> operation, this should allow me to do everything in <code>O(n log n)</code> time. Because I can determine the difference by looking at presence or absence of records with just the ID and the last modification date, I could also load everything in main memory as well. The time complexity will be the same, but with the added benefit of less disk I/O, which should make this faster by orders of magnitude. Suggestions? (Note: This is more of a performance question than anything)

RedGate's SQL Data Compare

How can I determine the difference between two large datasets?

Tags:

large-data-volumes

I have large datasets with millions of records in XML format. These datasets are full data dumps of a database up to a certain point in time.

Between two dumps new entries might have been added and existing ones might have been modified or deleted. Assume the schema remains unchanged and that every entry has a unique ID.

What would be the best way to determine the delta between two of these datasets (including deletions and updates)?

My plan is to load everything to an RDBMS and go from there.

First, load the older dump. Then, load the newer dump into a different schema, but in doing so I'll check if the entry is new or is an update to an existing entry. If yes, I'll log the ID on a new table(s) called "changes."

After this is all done, I'll go through the old dump going through all entries and see if they have a matching record (ie: same ID) on the new dump. If not, log to changes.

Assuming looking up a record by ID is a O(log n) operation, this should allow me to do everything in O(n log n) time.

Because I can determine the difference by looking at presence or absence of records with just the ID and the last modification date, I could also load everything in main memory as well. The time complexity will be the same, but with the added benefit of less disk I/O, which should make this faster by orders of magnitude.

Suggestions? (Note: This is more of a performance question than anything)

799

asked Sep 06 '11 17:09

NullUserException

1 Answers

RedGate's SQL Data Compare

answered Sep 21 '22 15:09

adamcodes

Related questions
                            
                                Query for width and height, a record with each greater than the other in the same query?
                            
                                SQL Azure : Connection to SQL Azure throws exception
                            
                                Will a key in sql still stay a key in a view
                            
                                How to use SQL wildcards in LINQ to Entity Framework
                            
                                For VAT tax, what is the correct Decimal(p, s) precision and scale for SQL Server field size declaration?
                            
                                Exporting data to a .sql format. How to escape?
                            
                                Building sqlite for windows in a proper way
                            
                                "Quick and Dirty" Facial Recognition and Database Storage/Lookup in Java
                            
                                SQL - Clone a record and its descendants
                            
                                opinions and advice on database structure
                            
                                Looking for a way to create dynamic SQL from a given SQL Query in Java
                            
                                Same SQL Query Slower from NHibernate Application than SQL Studio?
                            
                                Pagination: Find out which page an item is on (given primary key & sorting order)
                            
                                Is it helpful to compress strings before placing in database?
                            
                                Change in query plan and execution time with TOP and ESCAPE
                            
                                Is it better / more efficient to use sub queries or SELECT statements within the WHERE clause (in MS Access)
                            
                                auto increment after delete from a table [duplicate]
                            
                                Optimize 5 table SQL query (stores => items => words)
                            
                                mysql stored procedure for search from identical tables
                            
                                Managing database updates

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With