Database schema design pattern for data that merges

Tags:

We have a rather antiquated database that contains a large number of individuals along with a number of achievements that they have completed. There has historically been little done to discourage duplicated individual data so we've ended up in a situation where our data is quite dirty. A massively simplified version of this can be found here.

We're now redesigning the schema and user interface. We will be providing the user with a tool to merge their individuals together. In the provided example, Dave and David are clearly the same person and has achieved 4 achievements in total.

Given that users make mistakes and that there are a lot more tables involved than in the example, I'm looking for a schema design that facilitates easy merging of data, and notably, un-merging of data if (when!) the user inevitably makes a mistake.

Linked lists of some sort appear to be a solution but aren't exactly efficient for this use case. Are there any other concepts that might lend themselves to this situation? Any specific design patterns that might be appropriate?

Edit: As SQLFiddle is being rather flaky today here's the create/insert/select sql that was on sqlfiddle:

CREATE TABLE individual
    (`individual_id` int, `forename` varchar(50), `surname` varchar(50))
;

CREATE TABLE achievement
    (`achievement_id` int, `name` varchar(50), `description` varchar(50))
;

CREATE TABLE individual_achievement
    (`individual_id` int,`achievement_id` int)
;

INSERT INTO individual
    (`individual_id`, `forename`, `surname`)
VALUES
    (1, 'Dave', 'Deane'),
    (2, 'David', 'Deane')
;

INSERT INTO achievement
    (`achievement_id`, `name`, `description`)
VALUES
    (1, 'unit_1', 'Unit 1'),
    (2, 'unit_2', 'Unit 2'),
    (3, 'unit_3', 'Unit 3'),
    (4, 'unit_4', 'Unit 4')
;

INSERT INTO individual_achievement
    (`individual_id`,`achievement_id`)
VALUES
    (1, 1),
    (1, 3),
    (2, 2),
    (2, 4)
;

select * from individual i
join individual_achievement ai using (individual_id)
join achievement a using (achievement_id)

Edit 2: Just found this very similar question, hoping in 4 years there might other solutions too.

362

asked Aug 07 '15 11:08

Rob Forrest

1 Answers

Here’s one tactic you could use.

First, create a new table, for now call it “Individual_v2”, with the exact same columns as original table Individual. (Ideally, you will eventually replace Individual with this table; realistically, people might still enter data into Individual, and you’ll have to “clean” the data by moving or merging it into Individual_v2.) Configure this table with links to Achievement. (For now, I’m assuming Achievement is clean.)

Then, create a “Mapping” table like so:

IndividualMapping

OldIndividual_Id
NewIndividual_Id
CreatedAt
CreatedBy
ApprovedAt  --  Nullable!
ApprovedBy  --  Nullable!

The “Created” columns are used to determine when and by who (or what) the mapping was created.

The “Approved” columns are used to determine if the data has been migrated to the new tables.

For every “old” item, you determine where it might map in the “new” table; if it maps to no existing item, create one for it in the new table.

Then, add an entry in the mapping table. If a new item was created, mark it as approved; if confidence is high, mark it as approved; otherwise, leave it “unapproved” and waiting for review. In due time a reviewer will look things over and approve the mapping, change the mapping to a different existing new item, or create another new item and map to it.

Once completed, the “real” work is done against the new table. The old table and the mapping table can be used to identify where the new data came from and, if necessary, to undo/change mappings.

There are a lot of unanswered implementation and support issues here, and overall it seems awkward as heck. Long-term, once you’ve resolved the problem of duplicate data you can drop the old (and the mapping) table, but until then you’ll have a fussy system.

addenda

I’m kind of talking through things here, without going into exhaustive analysis. The system I think you’re describing is going to be fussy and conceptually complex to work with, even if the tables are relatively simple, and final details are outside the scope of an SO questions. Too, a lot depends on what the overall goals and objectives of the system and its redesign are. I’m going to make some assumptions here:

The “existing” system will stay in place
Individuals (and their awards) so entered must be made immediately available, as they always have been.
Duplicates will continue to be entered; if, when, and as feasible, they are to be “consolidated” with pre-existing entries

Done this way, the system would work something as follows:

There’s a separate relation table between the Individuals_v2 and Achievement (Individual_Achievement_v2 for now, though there must be a better name).
The data in the “v2” tables is correct, good, and right. The “v1” tables are staging, history, log data.
Prepare an initial release, where all entries in the v1 tables are configured in the v2 tables. If rows can be consolidated during this step, so much the better. Everything gets logged in the “map” table, so that they can be cleanly back and redone if necessary.
Going forward from this release, new data is entered in the v1 tables, and simultaneously/immediately entered in the v2 tables as well. If a mapping to an existing item can be made, do so, otherwise create a new entry in the v2 tables. Always log activity in the “map” table.
Going forward, all “Live” queries hit the v2 tables. The v1 tables are (again) history, log, audit trail. Once populated, they are never modified, while the v2 tables (including the mapping table) can and will be.
As determined by the business, periodic reviews/checks are made on the data, to look for and fix duplicate entries that have appeared over time, as well as “invalid duplicates” (incorrect mappings). This is when you do the rollback/redo work, as tracked in the mapping and v1 tables.

You’ll maybe want some extra logging tables, to track things like “all data entered up through xx/xx/xxxx is valid, data entered since then must be reviewed”. I’m sure there are other issues and subtleties that will crop up—they always do…

answered Oct 21 '22 10:10

Philip Kelley

Related questions
                            
                                Best place to bring up new window in Model View ViewModel
                            
                                Better design pattern for MongoDB document-object-mapper
                            
                                Is there any design pattern to switch between data depending on the device type?
                            
                                Which design pattern is better to work with Controllers/Services/DAOs
                            
                                How to design laravel 5.1 application for web client and native mobile apps
                            
                                Protocol Oriented Programming and the Delegate Pattern
                            
                                Why is factory method a class pattern, while an abstract factory an object pattern?
                            
                                ASP.NET MVP - Utilizing User Controls
                            
                                Open source OCR for Chinese [closed]
                            
                                Why both SqlConnection and SqlTransaction are present in SqlCommand constructor?
                            
                                How to execute a method in another thread?
                            
                                What is a clean way (or pattern) to do function parameter checking?
                            
                                A good design pattern for implementing different behaviors on a subject
                            
                                What's the best pattern to design an asynchronous RPC application using Python, Pika and AMQP?
                            
                                Is it acceptable to use extension methods on a class which you can modify
                            
                                Why is diamond of death and not triangle of death [duplicate]
                            
                                How to implement Observer pattern in C++ [closed]
                            
                                Visitor pattern with Java 8 default methods
                            
                                Golang auto-including extensible application
                            
                                Comparing Java swing MVC with Android design pattern

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Database schema design pattern for data that merges

Tags:

merge

design-patterns

database-schema

database-design

Rob Forrest

People also ask

1 Answers

Philip Kelley

Recent Activity

Donate For Us