Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Database Design for Revisions?

I think the key question to ask here is 'Who / What is going to be using the history'?

If it's going to be mostly for reporting / human readable history, we've implemented this scheme in the past...

Create a table called 'AuditTrail' or something that has the following fields...

[ID] [int] IDENTITY(1,1) NOT NULL,
[UserID] [int] NULL,
[EventDate] [datetime] NOT NULL,
[TableName] [varchar](50) NOT NULL,
[RecordID] [varchar](20) NOT NULL,
[FieldName] [varchar](50) NULL,
[OldValue] [varchar](5000) NULL,
[NewValue] [varchar](5000) NULL

You can then add a 'LastUpdatedByUserID' column to all of your tables which should be set every time you do an update / insert on the table.

You can then add a trigger to every table to catch any insert / update that happens and creates an entry in this table for each field that's changed. Because the table is also being supplied with the 'LastUpdateByUserID' for each update / insert, you can access this value in the trigger and use it when adding to the audit table.

We use the RecordID field to store the value of the key field of the table being updated. If it's a combined key, we just do a string concatenation with a '~' between the fields.

I'm sure this system may have drawbacks - for heavily updated databases the performance may be hit, but for my web-app, we get many more reads than writes and it seems to be performing pretty well. We even wrote a little VB.NET utility to automatically write the triggers based on the table definitions.

Just a thought!


  1. Do not put it all in one table with an IsCurrent discriminator attribute. This just causes problems down the line, requires surrogate keys and all sorts of other problems.
  2. Design 2 does have problems with schema changes. If you change the Employees table you have to change the EmployeeHistories table and all the related sprocs that go with it. Potentially doubles you schema change effort.
  3. Design 1 works well and if done properly does not cost much in terms of a performance hit. You could use an xml schema and even indexes to get over possible performance problems. Your comment about parsing the xml is valid but you could easily create a view using xquery - which you can include in queries and join to. Something like this...
CREATE VIEW EmployeeHistory
AS
, FirstName, , DepartmentId

SELECT EmployeeId, RevisionXML.value('(/employee/FirstName)[1]', 'varchar(50)') AS FirstName,

  RevisionXML.value('(/employee/LastName)[1]', 'varchar(100)') AS LastName,

  RevisionXML.value('(/employee/DepartmentId)[1]', 'integer') AS DepartmentId,

FROM EmployeeHistories 

The History Tables article in the Database Programmer blog might be useful - covers some of the points raised here and discusses the storage of deltas.

Edit

In the History Tables essay, the author (Kenneth Downs), recommends maintaining a history table of at least seven columns:

  1. Timestamp of the change,
  2. User that made the change,
  3. A token to identify the record that was changed (where the history is maintained separately from the current state),
  4. Whether the change was an insert, update, or delete,
  5. The old value,
  6. The new value,
  7. The delta (for changes to numerical values).

Columns which never change, or whose history is not required, should not be tracked in the history table to avoid bloat. Storing the delta for numerical values can make subsequent queries easier, even though it can be derived from the old and new values.

The history table must be secure, with non-system users prevented from inserting, updating or deleting rows. Only periodic purging should be supported to reduce overall size (and if permitted by the use case).


We have implemented a solution very similar to the solution that Chris Roberts suggests, and that works pretty well for us.

Only difference is that we only store the new value. The old value is after all stored in the previous history row

[ID] [int] IDENTITY(1,1) NOT NULL,
[UserID] [int] NULL,
[EventDate] [datetime] NOT NULL,
[TableName] [varchar](50) NOT NULL,
[RecordID] [varchar](20) NOT NULL,
[FieldName] [varchar](50) NULL,
[NewValue] [varchar](5000) NULL

Lets say you have a table with 20 columns. This way you only have to store the exact column that has changed instead of having to store the entire row.


Avoid Design 1; it is not very handy once you will need to for example rollback to old versions of the records - either automatically or "manually" using administrators console.

I don't really see disadvantages of Design 2. I think the second, History table should contain all columns present in the first, Records table. E.g. in mysql you can easily create table with the same structure as another table (create table X like Y). And, when you are about to change structure of the Records table in your live database, you have to use alter table commands anyway - and there is no big effort in running these commands also for your History table.

Notes

  • Records table contains only lastest revision;
  • History table contains all previous revisions of records in Records table;
  • History table's primary key is a primary key of the Records table with added RevisionId column;
  • Think about additional auxiliary fields like ModifiedBy - the user who created particular revision. You may also want to have a field DeletedBy to track who deleted particular revision.
  • Think about what DateModified should mean - either it means where this particular revision was created, or it will mean when this particular revision was replaced by another one. The former requires the field to be in the Records table, and seems to be more intuitive at the first sight; the second solution however seems to be more practical for deleted records (date when this particular revision was deleted). If you go for the first solution, you would probably need a second field DateDeleted (only if you need it of course). Depends on you and what you actually want to record.

Operations in Design 2 are very trivial:

Modify
  • copy the record from Records table to History table, give it new RevisionId (if it is not already present in Records table), handle DateModified (depends on how you interpret it, see notes above)
  • go on with normal update of the record in Records table
Delete
  • do exactly the same as in the first step of Modify operation. Handle DateModified/DateDeleted accordingly, depending on the interpretation you have chosen.
Undelete (or rollback)
  • take highest (or some particular?) revision from History table and copy it to the Records table
List revision history for particular record
  • select from History table and Records table
  • think what exactly you expect from this operation; it will probably determine what information you require from DateModified/DateDeleted fields (see notes above)

If you go for Design 2, all SQL commands needed to do that will be very very easy, as well as maintenance! Maybe, it will be much much easier if you use the auxiliary columns (RevisionId, DateModified) also in the Records table - to keep both tables at exactly the same structure (except for unique keys)! This will allow for simple SQL commands, which will be tolerant to any data structure change:

insert into EmployeeHistory select * from Employe where ID = XX

Don't forget to use transactions!

As for the scaling, this solution is very efficient, since you don't transform any data from XML back and forth, just copying whole table rows - very simple queries, using indices - very efficient!