I've read a few questions on SO (such as this one) in regards to versioning your data within a database.
I liked some of the suggestions that were mentioned. I have for the longest time wanted (needed) to revision many of my tables but never got around to it. Being a programmer with only simple database work under my belt I was wondering how one would actually go about doing this.
I'm not asking for the actual solution in SQL syntax. I can eventually figure that out for myself (or post SO when the time comes). I'm just asking for people to comment as how they would go about doing it and any potential performance problems there might be if I was to 'revision' hundreds of million of records. Or any other suggestions as long as it is based on the example below.
Given a simple example:
Person
------------------------------------------------
ID UINT NOT NULL,
PersonID UINT NOT NULL,
Name VARCHAR(200) NOT NULL,
DOB DATE NOT NULL,
Email VARCHAR(100) NOT NULL
Audit
------------------------------------------------
ID UINT NOT NULL,
UserID UINT NOT NULL, -- Who
TableName VARCHAR(50) NOT NULL, -- What
OldRecID UINT NOT NULL, -- Where
NewRecID UINT NOT NULL,
AffectedOn DATE NOT NULL, -- When
Comment VARCHAR(500) NOT NULL -- Why
I'm not sure how one would link the Audit table to any other tables (such as Person) if the TableName is a string?
Also, assuming that I have three GUI's to populate:
To accomplish 1 and 2, would it be better to query the Person table or the Audit table?
To accomplish 3, would a so called database expert simply get all records and pass it on to the software for processing, or group by PersonID and Affected date? Is this usually handled in one query or many?
Versioning a database means sharing all changes of a database that are neccessary for other team members in order to get the project running properly. Database versioning starts with a settled database schema (skeleton) and optionally with some data.
Typically, for versioning or storing historical data you do one of two (or both) things. You have a separate table that mimics the original table + a date/time column for the date it was changed. Whenever a record is updated, you insert the existing contents into the history table just prior to the update.
Data versioning is the storage of different versions of data that were created or changed at specific points in times. There are many different reasons for making changes to the data. Data scientists might test the ML models to increase efficiency and therefore make certain changes to the dataset.
I have done various audit schemes over the years and I am currently going to implement something like this:
Person
------------------------------------------------
ID UINT NOT NULL,
PersonID UINT NOT NULL,
Name VARCHAR(200) NOT NULL,
DOB DATE NOT NULL,
Email VARCHAR(100) NOT NULL
Person_History
------------------------------------------------
ID UINT NOT NULL,
PersonID UINT NOT NULL,
Name VARCHAR(200) NOT NULL,
DOB DATE NOT NULL,
Email VARCHAR(100) NOT NULL
AuditID UINT NOT NULL
Audit
------------------------------------------------
ID UINT NOT NULL,
UserID UINT NOT NULL, -- Who
AffectedOn DATE NOT NULL, -- When
Comment VARCHAR(500) NOT NULL -- Why
The current records are always in the Person table. If there is a change an audit record is created and the old record is copied into the Person_History table (note the ID does not change and there can be multiple versions)
The Audit ID is in the *_History tables so you can link multiple record changes to one audit record if you like.
EDIT:
If you don't have a separate history table for each base table and want to use the same table to hold old and "deleted" records then you have to mark the records with a status flag. The problem with that it's a real pain when querying for current records - trust me I've done that.
How about you create the table as normal, have a ModifiedDate Colm on each record (and ModifiedBy if you like), and do all your data access through a materialized view which groups the data by Id and then does a HAVING ModifiedDate = MAX(ModifiedDate)?
This way, adding a new record with the same Id as another will remove the old record from the view. If you want to query history, don't go through the view
I've always found maintaining different tables with the same Colm to be complex and error prone.
edit: I've just returned to this answer 12 years after I wrote it. I would say that the the original question is misguided - you should be auditing user level events, not changes to database columns.
Following DJ's post in using a history table per base table and a comment by Karl about possible performance issues, I've done a bit of SQL research in order to figure out the fastest possible way to transfer a record from one table to another.
I just wanted to document what I found:
I thought that I would have to do an SQL fetch to load the record from the base table, followed with an SQL push to put the record into the history table, followed by an update to the base table to insert the changed data. Total of 3 transactions.
But to my surprise I realized that you can do the first two transactions using one SQL statement using the SELECT INTO syntax. I'm betting performance would be a hundred fold faster doing this.
Then that would leave us to simply UPDATE the record with the new data within the base table.
I still haven't found one SQL statement to do all 3 transactions at once (I doubt I will).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With