Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging two distinct Postgresql databases into a single database

The two databases have identical schemas, but distinct data. It's possible there will be some duplication of rows, but it's sufficient for the merge to bail noisily and not do the update if duplicates are found, i.e., duplicates should be resolved manually.

Part of the problem is that there are a number of foreign key constraints in the databases in question. Also, there may be some columns which reference foreign keys which do not actually have foreign key constraints. These latter are due to performance issues on insertion. Also, we need to be able to map between the ids from the old databases and the IDs in the new database.

Obviously, we can write a bunch of code to handle this, but we are looking for a solution which is:

  1. Less work
  2. Less overhead on the machines doing the merge.
  3. More reliable. If we have to write code it will need to go through testing, etc. and isn't guaranteed to be bug free

Obviously we are still searching the web and the Postgresql documentation for the answer, but what we've found so far has been unhelpful.

Update: One thing I clearly left out is that "duplicates" are clearly defined by unique constraints in the schema. We expect to restore the contents of one database, then restore the contents of a second. Errors during the second restore should be considered fatal to the second restore. The duplicates should then be removed from the second database and a new dump created. We want the IDs to be renumbered, but not the other unique constraints. It's possible, BTW, that there will be a third or even a fourth database to merge after the second.

like image 632
aikimcr Avatar asked Jan 06 '14 22:01

aikimcr


2 Answers

There's no shortcut to writing a bunch of scripts… This cannot realistically be automated, since managing conflicts requires applying rules that will be specific to your data.

That said, you can reduce the odds of conflicts by removing duplicate surrogate keys…

Say your two databases have only two tables: A (id pkey) and B (id pkey, a_id references A(id)). In the first database, find max_a_id = max(A.id) and max_b_id = max(B.id).

In the second database:

  1. Alter table B if needed so that a_id does cascade updates.
  2. Disable triggers if any have side effects that might erroneously kick in.
  3. Update A and set id = id + max_a_id, and the same kind of thing for B.
  4. Export the data

Next, import this data into the first database, and update sequences accordingly.

You'll still need to be wary of overflows if IDs can end up larger than 2.3 billion, and of unique keys that might exist in both databases. But at least you won't need to worry about dup IDs.

like image 87
Denis de Bernardy Avatar answered Oct 15 '22 03:10

Denis de Bernardy


This is the sort of case I'd be looking into ETL tools like CloverETL, Pentaho Kettle or Talend Studio for.

I tend to agree with Denis that there aren't any real shortcuts to avoid dealing with the complexity of a data merge.

like image 43
Craig Ringer Avatar answered Oct 15 '22 02:10

Craig Ringer