I'm looking for some input on the best way to design a data model that revolves around versioned data. There will one-to-many and many-to-many relationships which can all change from version to version.
I'm looking for some different strategies with the ultimate goal being efficient comparisons and if possible only storing the delta.
The three primary data model types are relational, dimensional, and entity-relationship (E-R). There are also several others that are not in general use, including hierarchical, network, object-oriented, and multi-value.
The most comprehensive definition of a data model comes from Edgar Codd (1980): A data model is composed of three components: 1) data structures, 2) operations on data structures, and 3) integrity constraints for operations and structures.
This is actually a fairly difficult problem.
Versioning objects is easy. Versioning connections between them not so much - you'll have to make some design decisions. For example:
On top of that, most of the "supporting" tables will probably need to be "version aware" as well.
If I were you, I'd probably work my way from the following starting point:
The symbol between OBJECT and CONNECTION is "category" (aka. inheritance, subclass, generalization hierarchy etc.).
The basic idea behind this design is to support "snapshot", "restore" and "delta" functionality:
The querying would go something like this:
Let's say you have to put objects A, B and C, where A is parent for B and C:
generation: 0
A0
/ \
B0 C0
Add new object D:
generation: 0 1
A0
/ | \
B0 C0 D1
Modify A and C and delete B:
generation: 0 1 2
A0
A2
/ | \
B0 C0 D1
B2* C2
(*) OBJECT_VERSION.DELETED is true
Move C from A to D:
generation: 0 1 2 3
A0
A2
/ |* \
B0 C0 D1
B2* C2 |
C3
Etc...
This design is open to anomalies with inconsistent deletions: the database won't defend itself from connecting a deleted and non-deleted object, or evolving one of the objects into a deleted state without also deleting the connection. You won't know whether a connection is valid until you examine both endpoints. If your data is hierarchical, you might employ a "reachability model" instead: object is not deleted if it can be reached from some root object. You never directly delete the object - you just delete all connections to it. This can work well for hierarchies such as folders/files or similar, where you start from the "top" and search towards the bottom until you reach the desired object(s).
Alternative to "immutable" connections is inheriting CONNECTION_VERSION from OBJECT_VERSION and placing PARENT_ID/CHILD_ID there, using identifying relationships to ensure the diamond-shaped dependency is correctly modeled. This could be useful if you need to track the history of moves.
These are just broad strokes of course, I hope you'll find your way...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With