Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Database - Data Versioning (followup)

My original question can be found here, for which I've gotten some great answers, idas and tips.

As part of a feasibility and performance study, I've started to convert my schemas in order to version my data using those ideas. In doing so, I've come up with some kind of other problem.

In my original question, my example was simple, with no real relational references. In an attempt to preserve the example of my previous question, I will now extend the 'Name' part to another table.

So now, my data becomes:

Person
------------------------------------------------
ID                UINT NOT NULL,
NameID            UINT NOT NULL,
DOB               DATE NOT NULL,
Email             VARCHAR(100) NOT NULL

PersonAudit
------------------------------------------------
ID                UINT NOT NULL,
NameID            UINT NOT NULL,
DOB               DATE NOT NULL,
Email             VARCHAR(100) NOT NULL,
UserID            UINT NOT NULL,         -- Who
PersonID          UINT NOT NULL,         -- What
AffectedOn        DATE NOT NULL,         -- When
Comment           VARCHAR(500) NOT NULL  -- Why

Name
------------------------------------------------
ID                UINT NOT NULL,
FirstName         VARCHAR(200) NOT NULL,
LastName          VARCHAR(200) NOT NULL,
NickName          VARCHAR(200) NOT NULL

NameAudit
------------------------------------------------
ID                UINT NOT NULL,
FirstName         VARCHAR(200) NOT NULL,
LastName          VARCHAR(200) NOT NULL,
NickName          VARCHAR(200) NOT NULL,
UserID            UINT NOT NULL,         -- Who
NameID            UINT NOT NULL,         -- What
AffectedOn        DATE NOT NULL,         -- When
Comment           VARCHAR(500) NOT NULL  -- Why

In a GUI, we could see the following form:

ID            :  89213483
First Name    :  Firsty
Last Name     :  Lasty
Nick Name     :  Nicky
Date of Birth :  January 20th, 2005
Email Address :  [email protected]

A change can be made to:

  1. Only to the 'name' part
  2. Only to the 'person' part
  3. To both the 'name' and person parts

If '1' occurs, we copy the original record to NameAudit and update our Name record with the changes. Since the person reference to the name is still the same, no changes to Person or PersonAudit are required.

If '2' occurs, we copy the original record to PersonAudit and update the Person record with the changes. Since the name part has not changed, no changes to Name or NameAudit are required.

If '3' occurs, we update our database according to the two methods above.

If we were to make 100 changes to both the person and name parts, one problem occurs when you later try to show a history of changes. All my changes show the person having the last version of the name. Which is wrong obviously.

In order to fix this, it would seem that the NameID field in Person should reference the NameAudit instead (but only if Name has changes).

And it is this conditional logic that starts complicating things.

I would be curious to find out if anyone has had this kind of problem before with their database and what kind of solution was applied?

like image 245
Jeach Avatar asked May 25 '09 22:05

Jeach


1 Answers

You should probably try to read about 'Temporal Database' handling. Two books you could look at are Darwen, Date and Lorentzos "Temporal Data and the Relational Model" and (at a radically different extreme) "Developing Time-Oriented Database Applications in SQL", Richard T. Snodgrass, Morgan Kaufmann Publishers, Inc., San Francisco, July, 1999, 504+xxiii pages, ISBN 1-55860-436-7. That is out of print but available as PDF on his web site at cs.arizona.edu. You can also look for "Allen's Relations" for intervals - they may be helpful to you.


I assume that the DATE type in your database includes time (so you probably use Oracle). The SQL standard type would probably be TIMESTAMP with some number of fractional digits for sub-second resolution. If your DBMS does not include time with DATE, then you face a difficult problem deciding how to handle multiple changes in a single day.

What you need to show, presumably, is a history of the changes in either table, with the corresponding values from the other table that were in force at the time when the changes were made. You also need to decide whether what you are showing is the before or after image; presumably, again, the after image. That means that you will have a 'sequenced' query (Snodgrass's term), with columns like:

Start time        -- When this set of values became valid
End time          -- When this set of values became invalid
PersonID          -- Person.ID (or PersonAudit.ID) for Person data
NameID            -- Name.ID (or NameAudit.ID) for Name data
DOB               -- Date of Birth recorded while data was valid
Email             -- Email address recorded while data was valid
FirstName         -- FirstName recorded while data was valid
LastName          -- LastName recorded while data was valid
NickName          -- NickName recorded while data was valid

I assume that once a Person.ID is established, it does not change; ditto for Name.ID. That means that they remain valid while the records do.

One of the hard parts in this is establishing the correct set of 'start time' and 'end time' values, since transitions could occur in either table (or both). I'm not even sure, at the moment, that you have all the data you need. When a new record is inserted, you don't capture the time it becomes valid (there is nothing in the XYZAudit table when you insert a new record, is there?).


There's a lot more could be said. Before going further, though, I'd rather have some feedback about some of the issues raised so far.


Some other SO questions that might help:

  • Data structure for non-overlapping ranges within a single dimension
  • Why do we need a temporal database
  • Determine whether two date ranges overlap
  • Best relational database representation of time-bound hierarchies

Since this answer was first written, there's another book published about another set of methods called 'Asserted Versioning' for handling temporal data. The book is 'Managing Time in Relational Databases: How to Design, Update and Query Temporal Data' by Tom Johnston and Randall Weiss. You can find their company at AssertedVersioning.com. Beware: there may be patent issues around the mechanism.

Also, the SQL 2011 standard (ISO/IEC 9075:2011, in a number of parts) has been published. It includes some temporal data support. You can find out more about that and other issues related to temporal data at TemporalData.com, which is more of a general information site rather than one with a particular product axe to grind.

like image 166
Jonathan Leffler Avatar answered Sep 30 '22 13:09

Jonathan Leffler