The two databases have identical schemas, but distinct data. It's possible there will be some duplication of rows, but it's sufficient for the merge to bail noisily and not do the update if duplicates are found, i.e., duplicates should be resolved manually.
Part of the problem is that there are a number of foreign key constraints in the databases in question. Also, there may be some columns which reference foreign keys which do not actually have foreign key constraints. These latter are due to performance issues on insertion. Also, we need to be able to map between the ids from the old databases and the IDs in the new database.
Obviously, we can write a bunch of code to handle this, but we are looking for a solution which is:
Obviously we are still searching the web and the Postgresql documentation for the answer, but what we've found so far has been unhelpful.
Update: One thing I clearly left out is that "duplicates" are clearly defined by unique constraints in the schema. We expect to restore the contents of one database, then restore the contents of a second. Errors during the second restore should be considered fatal to the second restore. The duplicates should then be removed from the second database and a new dump created. We want the IDs to be renumbered, but not the other unique constraints. It's possible, BTW, that there will be a third or even a fourth database to merge after the second.
There's no shortcut to writing a bunch of scripts… This cannot realistically be automated, since managing conflicts requires applying rules that will be specific to your data.
That said, you can reduce the odds of conflicts by removing duplicate surrogate keys…
Say your two databases have only two tables: A (id pkey)
and B (id pkey, a_id references A(id))
. In the first database, find max_a_id = max(A.id)
and max_b_id = max(B.id)
.
In the second database:
a_id
does cascade updates.id = id + max_a_id
, and the same kind of thing for B.Next, import this data into the first database, and update sequences accordingly.
You'll still need to be wary of overflows if IDs can end up larger than 2.3 billion, and of unique keys that might exist in both databases. But at least you won't need to worry about dup IDs.
This is the sort of case I'd be looking into ETL tools like CloverETL, Pentaho Kettle or Talend Studio for.
I tend to agree with Denis that there aren't any real shortcuts to avoid dealing with the complexity of a data merge.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With