Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generic version control strategy for select table data within a heavily normalized database

Sorry for the long winded title, but the requirement/problem is rather specific.

With reference to the following sample (but very simplified) structure (in psuedo SQL), I hope to explain it a bit better.

TABLE StructureName {
  Id GUID PK,
  Name varchar(50) NOT NULL
}

TABLE Structure {
  Id GUID PK,
  ParentId GUID,                 -- FK to Structure
  NameId GUID NOT NULL           -- FK to StructureName
}

TABLE Something {
  Id GUID PK,
  RootStructureId GUID NOT NULL  -- FK to Structure
}

As one can see, Structure is a simple tree structure (not worried about ordering of children for the problem). StructureName is a simplification of a translation system. Finally 'Something' is simply something referencing the tree's root structure.

This is just one of many tables that need to be versioned, but this one serves as a good example for most cases.

There is a requirement to version to any changes to the name and/or the tree 'layout' of the Structure table. Previous versions should always be available.

There seems to be a few possibilities to tackle this issue, like copying the entire structure, but most approaches causes one to 'loose' referential integrity. Example if one followed this approach, one would have to make a duplicate of the 'Something' record, given that the root structure will be a new record, and have a new ID.

Other avenues of possible solutions are looking into how Wiki's handle this or go a lot further and look how proper version control systems work.

Currently, I feel a bit clueless how to proceed on this in a generic way.

Any ideas will be greatly appreciated.

Thanks

leppie

like image 913
leppie Avatar asked Mar 03 '09 11:03

leppie


People also ask

What are the four 4 types of database normalization?

First Normal Form (1 NF) Second Normal Form (2 NF) Third Normal Form (3 NF) Boyce Codd Normal Form or Fourth Normal Form ( BCNF or 4 NF)

What SQL statement should the user use when selecting data from the database?

The SELECT statement is used to select data from a database. The data returned is stored in a result table, called the result-set.


2 Answers

Some quick ideas:

Full copy: Create a copy of the structure, but for every table add a version_id column to the PK and all FKs; thus you can create copies of the life data with complete referential integrity.

  • pro: easy to query the history
  • con: large amount of (redundant data copied)

Change copy: Only copy the stuff that actually changes, along with valid_from / valid_to data.

  • pro: low data volum copied
  • con: hard to query, because one has to join on intervals

Variation: This applies to both schemes. Instead of creating a copy of the structure, you might keept the current record in the same table as the old versions, but tag it as current.

  • pro: smaller number of tables, easier mixing of history and current information
  • con: normal operation operates on much bigger tables, which will cause a performance impact

Auditing log: Depending on your actual requirements it be sufficient to just create an audit trail like this:

id,  timestamp,  changed_table,  changed_column,  old_value,  new_value,  changed_by

You might extend that to a full table structure:

transaction,  table_change,  changed_column
  • pro: generic, hence easy to implement for a large number of tables
  • con: if you need to reconstruct the state of a set of records at a given time, querying will become a nightmare

I wrote a blog about various approaches to versioning, but be warned: it's in German.

like image 157
Jens Schauder Avatar answered Nov 15 '22 08:11

Jens Schauder


The data warehousing folks have several algorithms for "slowly-changing dimensions".

The more sophisticated algorithms provide data ranges around a dimension value to indicate when it's valid.

Depending on your versioning requirements you could do one of these things, cribbed from Kimball's The Data Warehousing Toolkit.

  1. Assign a version number to rows of the structure table. This means you have to do some reasoning to collect a a complete structure. It includes the selected version number unioned with rows that are unchanged in an earlier version.

  2. Assign a date range or version range to rows of the structure table. This means that some rows have start dates and end dates; some rows will have end dates at some epoch in the impossible future. Or, if you use version numbers, you'll have a start-end pair or a start-infinity pair that indicates this row is still current. You can then trivially query the rows that are valid "today" or apply to the requested version.

  3. Clone the structure for each version. This unpleasant because the clone operation is costly. The queries however, are trivial because the entire structure is available with a single, consistent version number.

like image 29
S.Lott Avatar answered Nov 15 '22 08:11

S.Lott