Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

design database relating to time attribute

I want to design a database which is described as follows: Each product has only one status at one time point. However, the status of a product can change during its life time. How could I design the relationship between product and status which can easily be queried all product of a specific status at current time? In addition, could anyone please give me some in-depth details about design database which related to time duration as problem above? Thanks for any help

like image 317
coolkid Avatar asked Nov 03 '10 01:11

coolkid


3 Answers

Here is a model to achieve your stated requirement.

Link to Time Series Data Model

Link to IDEF1X Notation for those who are unfamiliar with the Relational Modelling Standard.

  • Normalised to 5NF; no duplicate columns; no Update Anomalies, no Nulls.

  • When the Status of a Product changes, simply insert a row into ProductStatus, with the current DateTime. No need to touch previous rows (which were true, and remain true). No dummy values which report tools (other than your app) have to interpret.

  • The DateTime is the actual DateTime that the Product was placed in that Status; the "From", if you will. The "To" is easily derived: it is the DateTime of the next (DateTime > "From") row for the Product; where it does not exist, the value is the current DateTime (use ISNULL).

The first model is complete; (ProductId, DateTime) is enough to provide uniqueness, for the Primary Key. However, since you request speed for certain query conditions, we can enhance the model at the physical level, and provide:

  • An Index (we already have the PK Index, so we will enhance that first, before adding a second index) to support covered queries (those based on any arrangement of { ProductId | DateTime | Status } can be supplied by the Index, without having to go to the data rows). Which changes the Status::ProductStatus relation from Non-Identifying (broken line) to Identifying type (solid line).

  • The PK arrangement is chosen on the basis that most queries will be Time Series, based on Product⇢DateTime⇢Status.

  • The second index is supplied to enhance the speed of queries based on Status.

  • In the Alternate Arrangement, that is reversed; ie, we mostly want the current status of all Products.

  • In all renditions of ProductStatus, the DateTime column in the secondary Index (not the PK) is DESCending; the most recent is first up.

I have provided the discussion you requested. Of course, you need to experiment with a data set of reasonable size, and make your own decisions. If there is anything here that you do not understand, please ask, and I will expand.

Responses to Comments

Report all Products with Current State of 2

SELECT  ProductId,
        Description
    FROM  Product       p,
          ProductStatus ps
    WHERE p.ProductId = ps.ProductId  -- Join
    AND   StatusCode  = 2             -- Request
    AND   DateTime    = (             -- Current Status on the left ...
        SELECT MAX(DateTime)          -- Current Status row for outer Product
            FROM  ProductStatus ps_inner
            WHERE p.ProductId = ps_inner.ProductId
            )
  • ProductId is Indexed, leading col, both sides

  • DateTime in Indexed, 2nd col in Covered Query Option

  • StatusCode is Indexed, 3rd col in Covered Query Option

  • Since StatusCode in the Index is DESCending, only one fetch is required to satisfy the inner query

  • the rows are required at the same time, for the one query; they are close together (due to Clstered Index); almost always on the same page due to the short row size.

This is ordinary SQL, a subquery, using the power of the SQL engine, Relational set processing. It is the one correct method, there is nothing faster, and any other method would be slower. Any report tool will produce this code with a few clicks, no typing.

Two Dates in ProductStatus

Columns such as DateTimeFrom and DateTimeTo are gross errors. Let's take it in order of importance.

  1. It is a gross Normalisation error. "DateTimeTo" is easily derived from the single DateTime of the next row; it is therefore redundant, a duplicate column.

    • The precision does not come into it: that is easily resolved by virtue of the DataType (DATE, DATETIME, SMALLDATETIME). Whether you display one less second, microsecond, or nanosecnd, is a business decision; it has nothing to do with the data that is stored.
  2. Implementing a DateTo column is a 100% duplicate (of DateTime of the next row). This takes twice the disk space. For a large table, that would be significant unnecessary waste.

  3. Given that it is a short row, you will need twice as many logical and physical I/Os to read the table, on every access.

  4. And twice as much cache space (or put another way, only half as many rows would fit into any given cache space).

  5. By introducing a duplicate column, you have introduced the possibility of error (the value can now be derived two ways: from the duplicate DateTimeTo column or the DateTimeFrom of the next row).

  6. This is also an Update Anomaly. When you update any DateTimeFrom is Updated, the DateTimeTo of the previous row has to be fetched (no big deal as it is close) and Updated (big deal as it is an additional verb that can be avoided).

  7. "Shorter" and "coding shortcuts" are irrelevant, SQL is a cumbersome data manipulation language, but SQL is all we have (Just Deal With It). Anyone who cannot code a subquery really should not be coding. Anyone who duplicates a column to ease minor coding "difficulty" really should not be modelling databases.

Note well, that if the highest order rule (Normalisation) was maintained, the entire set of lower order problems are eliminated.

Think in Terms of Sets

  • Anyone having "difficulty" or experiencing "pain" when writing simple SQL is crippled in performing their job function. Typically the developer is not thinking in terms of sets and the Relational Database is set-oriented model.

  • For the query above, we need the Current DateTime; since ProductStatus is a set of Product States in chronological order, we simply need the latest, or MAX(DateTime) of the set belonging to the Product.

  • Now let's look at something allegedly "difficult", in terms of sets. For a report of the duration that each Product has been in a particular State: the DateTimeFrom is an available column, and defines the horizontal cut-off, a sub set (we can exclude earlier rows); the DateTimeTo is the earliest of the sub set of Product States.

SELECT               ProductId,
                     Description,
        [DateFrom] = DateTime,
        [DateTo]   = (
        SELECT MIN(DateTime)                        -- earliest in subset
            FROM  ProductStatus ps_inner
            WHERE p.ProductId = ps_inner.ProductId  -- our Product
            AND   ps_inner.DateTime > ps.DateTime   -- defines subset, cutoff
            )
    FROM  Product       p,
          ProductStatus ps
    WHERE p.ProductId = ps.ProductId 
    AND   StatusCode  = 2             -- Request
  • Thinking in terms of getting the next row is row-oriented, not set-oriented processing. Crippling, when working with a set-oriented database. Let the Optimiser do all that thinking for you. Check your SHOWPLAN, this optimises beautifully.

  • Inability to think in sets, thus being limited to writing only single-level queries, is not a reasonable justification for: implementing massive duplication and Update Anomalies in the database; wasting online resources and disk space; guaranteeing half the performance. Much cheaper to learn how to write simple SQL subqueries to obtain easily derived data.

like image 154
PerformanceDBA Avatar answered Sep 30 '22 15:09

PerformanceDBA


"In addition, could anyone please give me some in-depth details about design database which related to time duration as problem above?"

Well, there exists a 400-page book entitled "Temporal Data and the Relational Model" that addresses your problem.

That book also addresses numerous problems that the other responders have not addressed in their responses, for lack of time or for lack of space or for lack of knowledge.

The introduction of the book also explicitly states that "this book is not about technology that is (commercially) available to any user today.".

All I can observe is that users wanting temporal features from SQL systems are, to put it plain and simple, left wanting.

PS

Even if those 400 pages could be "compressed a bit", I hope you don't expect me to give a summary of the entire meaningful content within a few paragraphs here on SO ...

like image 27
Erwin Smout Avatar answered Sep 30 '22 16:09

Erwin Smout


tables similar to these:

product
-----------
product_id
status_id
name

status
-----------
status_id
name

product_history
---------------
product_id
status_id
status_time

then write a trigger on product to record the status and timestamp (sysdate) on each update where the status changes

like image 43
Randy Avatar answered Sep 30 '22 15:09

Randy