Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Designing a data model for versioned data

I'm looking for some input on the best way to design a data model that revolves around versioned data. There will one-to-many and many-to-many relationships which can all change from version to version.

I'm looking for some different strategies with the ultimate goal being efficient comparisons and if possible only storing the delta.

like image 958
binarymelon Avatar asked Oct 19 '12 17:10

binarymelon


People also ask

What are the 4 different types of data models?

The three primary data model types are relational, dimensional, and entity-relationship (E-R). There are also several others that are not in general use, including hierarchical, network, object-oriented, and multi-value.

What are the 3 major components of a data model?

The most comprehensive definition of a data model comes from Edgar Codd (1980): A data model is composed of three components: 1) data structures, 2) operations on data structures, and 3) integrity constraints for operations and structures.


1 Answers

Intro

This is actually a fairly difficult problem.

Versioning objects is easy. Versioning connections between them not so much - you'll have to make some design decisions. For example:

  • Do you need to get the "snapshot" of the whole graph at any moment in history?
  • Do you want permanent deletions versus ability to restore deleted objects and connections?
  • Do you prefer speed (and don't mind copying the whole graph between versions) or space?

On top of that, most of the "supporting" tables will probably need to be "version aware" as well.

Design

If I were you, I'd probably work my way from the following starting point:

enter image description here

The symbol between OBJECT and CONNECTION is "category" (aka. inheritance, subclass, generalization hierarchy etc.).

The basic idea behind this design is to support "snapshot", "restore" and "delta" functionality:

  • Whole graph has a global version (aka. "generation") and we only store deltas between them.
  • Every object is versioned with that global generation (as opposed to local, object-specific versions).
  • Connections are objects, which makes them versioned as well.
  • Every time a set of objects enters the repository, a new GENERATION is inserted and:
    • An inserted object is inserted into OBJECT and OBJECT_VERSION.
    • A modified object is inserted into OBJECT_VERSION.
    • A deleted object is inserted into OBJECT_VERSION, with DELETED = true.
    • A restored object is inserted into OBJECT_VERSION, with DELETED = false. BTW, this enables delete/restore cycle to be repeated multiple times.
    • The rest of objects are untouched, so we don't waste space copying the unchanged data around.
  • A connection can't really be modified. To "move" a child objects to a new parent, delete the old connection (by setting DELETED as described above) and insert a new one. In fact deletion is the only kind of modification supported by a connection.

The querying would go something like this:

  • To get a single object, out of all its versions, pick the highest one that is still not higher than the desired generation. If this version's DELETED is true, object is not present in this generation.
  • To get the snapshot of the whole graph at the desired generation, do the above for all objects and create in-memory graph. Eliminate connections whose one or both endpoints are DELETED.
  • To get objects connected to a given object, recursively traverse CONNECTION, but cut the recursion as soon as you encounter an object that doesn't fulfill the criteria above.

Example

Let's say you have to put objects A, B and C, where A is parent for B and C:

generation: 0

      A0
     /  \
   B0    C0

Add new object D:

generation: 0 1

      A0
     / | \
   B0  C0 D1

Modify A and C and delete B:

generation: 0 1 2

      A0
      A2
     / | \
   B0  C0 D1
   B2* C2

   (*) OBJECT_VERSION.DELETED is true

Move C from A to D:

generation: 0 1 2 3

      A0
      A2
     / |* \
   B0  C0  D1
   B2* C2  |
           C3

Etc...

Some Musings

This design is open to anomalies with inconsistent deletions: the database won't defend itself from connecting a deleted and non-deleted object, or evolving one of the objects into a deleted state without also deleting the connection. You won't know whether a connection is valid until you examine both endpoints. If your data is hierarchical, you might employ a "reachability model" instead: object is not deleted if it can be reached from some root object. You never directly delete the object - you just delete all connections to it. This can work well for hierarchies such as folders/files or similar, where you start from the "top" and search towards the bottom until you reach the desired object(s).

Alternative to "immutable" connections is inheriting CONNECTION_VERSION from OBJECT_VERSION and placing PARENT_ID/CHILD_ID there, using identifying relationships to ensure the diamond-shaped dependency is correctly modeled. This could be useful if you need to track the history of moves.

These are just broad strokes of course, I hope you'll find your way...

like image 149
Branko Dimitrijevic Avatar answered Sep 20 '22 09:09

Branko Dimitrijevic