Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently version records in an SQL database

In at least one application, I have the need to keep old versions of records in a relational database. When something should be updated, instead a new copy would be added and the old row would be marked as not current. When something should be deleted, it should instead be marked as not current or deleted.

There is a simple use case of this: New versions of a record can only be added at the current time, superseding one row each. This can be used for archiving previous records when saving new data. For this, I'd add the following columns to each table:

VersionTime datetime -- Time when this versions becomes effective
IsCurrent bool -- Indicates whether this version is the most current (and not deleted)

This is good if you only need to know what the most current version of a record is, and only enumerate previous versions of a single record separately. Point-in-time queries are even more painful than with the second variant.

A more generic variant is this: Versions of records can be added at any time for any specified validity time range. So I could declare that some setting of an entity is valid until end of 2013, and another version of it is valid in 2014, and yet another version will be valid from 2015 on. This can be used to both, archive old data (as above), and plan ahead to use different data at some time in the future (and to keep this information as an archive). For this, I'd add the following columns to each table:

ValidFrom datetime -- Time when this version becomes valid (inclusive)
ValidTo datetime -- Time when this version becomes invalid (exclusive)

The second approach can basically represent the first as well, but it's harder to know what version is the most recent - because you can also add versions for the future. Also, the ValidFrom/ValidTo design is able to declare overlapping ranges, and by definition, the row with the highest ValidFrom shall apply in that case.

Now I'm wondering how to implement an efficient solution to manage and query such data. Normally you can just write any SQL queries with any kind of WHERE, GROUP BY and JOIN to get the records you want. But with versioning applied, you need to consider the correct version of each record. So instead of joining every version of a record from another table, an appropriate condition must be added to only select the version that is valid at a given time.

An example:

SELECT a, b, c
FROM t1

Must be changed to:

SELECT a, b, c
FROM t1
WHERE t1.ValidFrom <= :time AND t1.ValidTo > :time
ORDER BY t1.ValidFrom
LIMIT 1

More complex with a table join:

SELECT a, b, c
FROM t1
    LEFT JOIN t2 ON (t2.a = t1.a)

Must be changed to:

SELECT a, b, c
FROM t1
    LEFT JOIN t2 ON (t2.a = t1.a)
WHERE t1.ValidFrom <= :time AND t1.ValidTo > :time
    AND t2.ValidFrom <= :time AND t2.ValidTo > :time

This still doesn't handle selecting the right version of overlapping time spans. I could add some clean-up method that flattens out overlapping version time ranges, but I don't know how efficient that would be.

I'm seeking to create a class (in C# in my case) that provides methods to read and write such versioned records. The writing is relatively easy because the queries are simple and easy to control with transactions. but querying would require building an API that accepts every fragment of an SQL SELECT query and intelligently builds the SQL query to execute from that. Thie query method should only accept one additional parameter that specifies the time to fetch the data from. Depending on each entity's validity range, different versions would be selected of each.

These are basically my incomplete thoughts about versioning data and providing an API to manage it. Have you already done such a thing and would like to tell me what you think of it? Do you have another idea that worked well? Could you offer me any advice on how to implement this API? While I theoretically know how to do it, I think it's a lot of work and I can't estimate how well it will work.

like image 673
ygoe Avatar asked Jan 27 '14 14:01

ygoe


People also ask

How do you store a version number in a database?

There is no generally-accepted place to store a version number in a database schema. Databases don't have version numbers by convention. There is no standard way of putting in this or any other information about an application into the database.

Does SQL have version control?

SQL Source Control plugs into SQL Server Management Studio (SSMS) and links your databases to an existing version control system, such as Git, TFS or Subversion. This allows you to manage changes to database schema and static data alongside application code.


2 Answers

If you need old data being part of your business logic then:

  • Save latest version in master table.(insert and update, delete will just change the status column)
  • Take snapshot when an update happens in detail table(before any update an snapshot will be created).

revision history

  • Another alternative will be Event Sourcing pattern.

If old data is just a trace log of changes then:

  • An Entity–attribute–value approach may come in handy. An implementation sample can be found here.
like image 67
Mohsen Heydari Avatar answered Oct 29 '22 15:10

Mohsen Heydari


I Know that this is an old post, But I wanted to reply not only to provide solution but also to exchange my ideas with you and also to discuss the most efficient solution for this important issue of versioning.

My idea is,

Create a table that contains 5 main versioning fields

  • Serial (Incremental number) is the real identifier and used for joins
  • ID (Self-Foreign key) is equal to the (Serial) Field value when the record is created
  • ValidFrom (Data from which the record became active)
  • ValidTo (Data to which the record became inactive) => Will be null for a current version
  • IsCurrent (Flag indicating that record is active)

When updating a record

  • Update the field to set (ValidTo) to be NOW datetime and set (IsCurrent) to false

  • Insert a new record by increment the (Serial) Field and keeping the very same Field (ID) of the updated record, (ValidFrom) will be NOW and (ValidTo) will be null and IsCurrent will have false.

When Deleting record

ValidTo will be set to NOW time IsCurrent set to false

by this way you will not have problems with joins as joining tables with field ID will show you all record history.

IF you have FKs to a parent table , You probably want to remove the value of the FK field.

like image 30
Kareem Saadeldeen Avatar answered Oct 29 '22 15:10

Kareem Saadeldeen