Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Database schema design pattern for data that merges

We have a rather antiquated database that contains a large number of individuals along with a number of achievements that they have completed. There has historically been little done to discourage duplicated individual data so we've ended up in a situation where our data is quite dirty. A massively simplified version of this can be found here.

We're now redesigning the schema and user interface. We will be providing the user with a tool to merge their individuals together. In the provided example, Dave and David are clearly the same person and has achieved 4 achievements in total.

Given that users make mistakes and that there are a lot more tables involved than in the example, I'm looking for a schema design that facilitates easy merging of data, and notably, un-merging of data if (when!) the user inevitably makes a mistake.

Linked lists of some sort appear to be a solution but aren't exactly efficient for this use case. Are there any other concepts that might lend themselves to this situation? Any specific design patterns that might be appropriate?

Edit: As SQLFiddle is being rather flaky today here's the create/insert/select sql that was on sqlfiddle:

CREATE TABLE individual
    (`individual_id` int, `forename` varchar(50), `surname` varchar(50))
;

CREATE TABLE achievement
    (`achievement_id` int, `name` varchar(50), `description` varchar(50))
;

CREATE TABLE individual_achievement
    (`individual_id` int,`achievement_id` int)
;

INSERT INTO individual
    (`individual_id`, `forename`, `surname`)
VALUES
    (1, 'Dave', 'Deane'),
    (2, 'David', 'Deane')
;

INSERT INTO achievement
    (`achievement_id`, `name`, `description`)
VALUES
    (1, 'unit_1', 'Unit 1'),
    (2, 'unit_2', 'Unit 2'),
    (3, 'unit_3', 'Unit 3'),
    (4, 'unit_4', 'Unit 4')
;

INSERT INTO individual_achievement
    (`individual_id`,`achievement_id`)
VALUES
    (1, 1),
    (1, 3),
    (2, 2),
    (2, 4)
;

select * from individual i
join individual_achievement ai using (individual_id)
join achievement a using (achievement_id)

Edit 2: Just found this very similar question, hoping in 4 years there might other solutions too.

like image 362
Rob Forrest Avatar asked Aug 07 '15 11:08

Rob Forrest


People also ask

What is design pattern in database?

A design pattern, or design solution, or simply design, is a response to a problem. The structure of a pattern is based (a) on the traditional pattern structure as delivered by Gamma et al [GHJV95] and (b) on the fundamentals of everyday operations around a database system.

What is design schemas of data modeling?

A schema is a collection of database objects, including tables, views, indexes, and synonyms. There is a variety of ways of arranging schema objects in the schema models designed for data warehousing. One data warehouse schema model is a star schema.

What are patterns in Mongodb?

The Attribute Pattern is useful for problems that are based around having big documents with many similar fields but there is a subset of fields that share common characteristics and we want to sort or query on that subset of fields. When the fields we need to sort on are only found in a small subset of documents.

What are design patterns in SQL?

A design pattern is simply a recognised way of solving a problem that works in different specific circumstances. Do such things exist in SQL? Of course - for example implementing many to many relationships between two tables using a third table.


1 Answers

Here’s one tactic you could use.

First, create a new table, for now call it “Individual_v2”, with the exact same columns as original table Individual. (Ideally, you will eventually replace Individual with this table; realistically, people might still enter data into Individual, and you’ll have to “clean” the data by moving or merging it into Individual_v2.) Configure this table with links to Achievement. (For now, I’m assuming Achievement is clean.)

Then, create a “Mapping” table like so:

IndividualMapping

OldIndividual_Id
NewIndividual_Id
CreatedAt
CreatedBy
ApprovedAt  --  Nullable!
ApprovedBy  --  Nullable!

The “Created” columns are used to determine when and by who (or what) the mapping was created.

The “Approved” columns are used to determine if the data has been migrated to the new tables.

For every “old” item, you determine where it might map in the “new” table; if it maps to no existing item, create one for it in the new table.

Then, add an entry in the mapping table. If a new item was created, mark it as approved; if confidence is high, mark it as approved; otherwise, leave it “unapproved” and waiting for review. In due time a reviewer will look things over and approve the mapping, change the mapping to a different existing new item, or create another new item and map to it.

Once completed, the “real” work is done against the new table. The old table and the mapping table can be used to identify where the new data came from and, if necessary, to undo/change mappings.

There are a lot of unanswered implementation and support issues here, and overall it seems awkward as heck. Long-term, once you’ve resolved the problem of duplicate data you can drop the old (and the mapping) table, but until then you’ll have a fussy system.


addenda

I’m kind of talking through things here, without going into exhaustive analysis. The system I think you’re describing is going to be fussy and conceptually complex to work with, even if the tables are relatively simple, and final details are outside the scope of an SO questions. Too, a lot depends on what the overall goals and objectives of the system and its redesign are. I’m going to make some assumptions here:

  • The “existing” system will stay in place

  • Individuals (and their awards) so entered must be made immediately available, as they always have been.

  • Duplicates will continue to be entered; if, when, and as feasible, they are to be “consolidated” with pre-existing entries

Done this way, the system would work something as follows:

  • There’s a separate relation table between the Individuals_v2 and Achievement (Individual_Achievement_v2 for now, though there must be a better name).

  • The data in the “v2” tables is correct, good, and right. The “v1” tables are staging, history, log data.

  • Prepare an initial release, where all entries in the v1 tables are configured in the v2 tables. If rows can be consolidated during this step, so much the better. Everything gets logged in the “map” table, so that they can be cleanly back and redone if necessary.

  • Going forward from this release, new data is entered in the v1 tables, and simultaneously/immediately entered in the v2 tables as well. If a mapping to an existing item can be made, do so, otherwise create a new entry in the v2 tables. Always log activity in the “map” table.

  • Going forward, all “Live” queries hit the v2 tables. The v1 tables are (again) history, log, audit trail. Once populated, they are never modified, while the v2 tables (including the mapping table) can and will be.

  • As determined by the business, periodic reviews/checks are made on the data, to look for and fix duplicate entries that have appeared over time, as well as “invalid duplicates” (incorrect mappings). This is when you do the rollback/redo work, as tracked in the mapping and v1 tables.

You’ll maybe want some extra logging tables, to track things like “all data entered up through xx/xx/xxxx is valid, data entered since then must be reviewed”. I’m sure there are other issues and subtleties that will crop up—they always do…

like image 66
Philip Kelley Avatar answered Oct 21 '22 10:10

Philip Kelley