Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Relational Database schema design for metric storage

Considering a system that has the following characteristics:

  • Stores time series data/metrics collected from multiple sensors/inputs.
  • Data points (metrics) are collected from many different systems at different times.
  • Each of these metrics is generally one data point (e.g. temp and humidity are not reported at the same time, but rather individually and will have a different timestamp)
  • The types of metrics that are collected will expand over time - the system is open and additional inputs will be supported over time (e.g. today we collect temp, humidity and cpu, tomorrow a sensor maybe added that monitors co2 and RAM).
  • A summary of all metrics for a given time bucket needs to be obtained via a query and it likely to be the most common querying scenario.

I can think of three ways of modeling this.

1. Wide table - with table per category (covered)

Notes: has lots of sparse values due to the data points being collected individually. Storage of new metrics require a new column

image

2. Narrow table - with table per metric (covered)

Notes: Storage of new metrics require a new table

image

3. Typed table (not covered) - with single metric table (not covered)

Notes: Storage of new metrics just require a new row in the metricType table, no schema changes. Concerned about performance implications due to chunk size although grouping by a time bucket across all metrics would not require joins and could therefore be faster?

image

I was wondering if anyone could comment or the options presented, point me to some performance bench marks that include 3 as well as 1 and 2 or generally give any advice on the suitability of each approach. I'm planning to run my own experiments on this and I will post the results when done, but any insight at this stage would be gratefully received. :)

Please note, do not suggest a nosql solution, I'm aware of the options in that space and am assessing that option separately

like image 778
Sam Shiles Avatar asked Feb 22 '19 09:02

Sam Shiles


2 Answers

It depends largely on the types of query you'll need to run. I think performance may not be your biggest concern if, as you say

A summary of all metrics for a given time bucket needs to be obtained via a query and it likely to be the most common querying scenario.

As queries in all scenarios would hit an indexable timestamp column, it really is just a question of the performance of joins, and pretty much every relational database is really good at that.

If your queries really are just "show data for a time range", your option 3 (an entity/attribute/value design) is most effective from a development effort point of view. .

Your query would have a single, inner join, and the timestamp column would provide a good index. As you say, you wouldn't need to change schema or queries when collecting new measurement points.

The alternative designs would require outer joins for each table. In performance terms, that's not a huge deal, but managing the schema and associated queries would be a pain.

However, if you also have to answer questions like "on what day was CPU above 30% while humidity was below 56% for more than 3 hours", your EAV model becomes really hard to work with. Those queries would rapidly become really hard to write and understand - every criterium becomes at least 1 self-join.

like image 172
Neville Kuyt Avatar answered Jan 03 '23 17:01

Neville Kuyt


1 Proposal

"Wide table"

That has gross Normalisation errors (as well as, if taken seriously, it has masses of Nulls and integrity problems). It is unuseable, no further comment is required.

"Narrow table"

That is free of errors, but the Normalisation is not yet complete.

"Typed table"

That is sort of complete, the "best" of your three scenarios. But it views the issue through a narrow lens, and in total isolation from the context in which the issue exists. Thus it is in error for reasons other than those you inquire about.

2 Problem

  1. The first problem is that you are comparing three things which are not reasonably comparable, not reasonably equal to each other.

  2. The second problem is, EAV is the flavour of the month, and many people are attracted to it. However, it has major problems, and requires an additional set of "metadata" tables if it is to be implemented with some data integrity. The point is, EAV is not needed.

3 Solution

The types of metrics that are collected will expand over time - the system is open and additional inputs will be supported over time (e.g. today we collect temp, humidity and cpu, tomorrow a sensor maybe added that monitors co2 and RAM).

This is actually a straight-forward Relational database problem, which is solved by a perfectly ordinary Relational design, which provides full Relation Power; Relational Integrity; and Speed (which other designs will not have).

3.1 Caveat

But there are a few caveats, due to the fact that what is marketed as "relational" is not Relational.

  1. Get rid of the Record ID fields, they are anti-Relational.

    • Record IDs reduce your schema to a 1970's style Record Filing system (located in an SQL container for convenience).
    • Record IDs do not provide row uniqueness, which is demanded by the Relational Model.
    • Further, they require one additional field and one additional index per file.
  2. When modelling a database (Relational or not), perceive the data, as data, and nothing but data. Do not view the data in terms of your need re the GUI, or some query or other.

  3. It is an error to concern yourself with performance issues at this (modelling) stage. First get it right. Second, make it fast. Do not reverse the prescribed sequence.

  4. Relational Keys provide meaning, as well as Relational Integrity (which is Logical, and distinct from Referential Integrity, which is a physical facility of SQL). What this addresses is the context in which an object exists.

    • A Sensor does not exist in isolation (except when it is in a package on a shelf in a shop ... but even then, it exists in the context of the shop inventory)
    • An active Sensor exists only in the context of the object in which it is housed. You have not provided any info regarding that. Let's call the thing Article as a generic label.
    • Further, it is the Article that requires a limit on the Metric that is being measured by the Sensor (for the purpose of out-of-range alarms, etc), and not the Sensor itself. (The Sensor may have a range, which is a different thing.)
    • Likewise, a Sensor exists in a Location, which is a second vector. Or else, the Article exists in a Location, and the Article Key carries the Location. I have modelled the latter.

3.2 Data Model

Here is the solution: Sensor Data Model

Inline graphics may not show up in some browsers. In that case, here it is in PDF.

  • It will satisfy both OLTP and OLAP (Dimension-Fact) requirements.

    • If you provide more context, we can get that modelled precisely. This may take a bit of to-and-fro.
  • It is limited to the info provided.

    • I have taken MetricType and SensorType to be synonymous.
    • Article is shown as Dependent on (exists within) Location, alternately they could be separate vectors. In any case, Article and Location together qualify Sensor.
    • Since SensorSerialNo is unique (AK2), therefore Reading(SensorSerialNo, DateTime) is unique. An index is not required. However, in the event there are many queries on Reading via SensorSerialNo alone, such an index will boost performance.
  • Please feel free to ask questions, and I will answer.

  • For those who are completely new to IDEF1X, refer to IDEF1X Introduction.

  • For those who are familiar with IDEF1X, and only want a brush-up, refer to IDEF1X Anatomy.

4 Performance

Your concern re performance is good, but far too premature to be applied at this stage. First get the data model right, second get the data structures fast. The reasons for that are many, not the least of which is, when the data is Normalised, Relationally, the structures are already very fast. Further, one should never optimise for a particular query (one can add indices, if necessary, in the second stage).

Nevertheless, I will respond to your stated concerns.

  • Eg. a ClusteredIndex on the prescribed Reading PK will:
    • Serve most queries, most Dimensions (except queries that use SensorSerialNo alone, in case of which I have suggested an additional index)
    • Serve all OLTP Transactions and ensure the highest concurrency, because the Sensors are distributed per the real world: across Locations and Articles`.
  • Whereas an Index on a Record ID guarantees a HotSpot on every single INSERT. Great for creating Deadlocks.

4.1 Benchmark

I do have a hundred or so benchmarks for data structures such as this, collected over the last four decades for both OLTP & OLAP use. Most of my customers are banks (Think: Sensor Readings are very much like Stock Prices that change over the period of a day; several vectors (Dimensions); billions of rows). Banks are very strict about confidentiality, so I cannot publish the benchmarks as is, and redacting them will take time and effort.

I do have one benchmark for a very similar requirement, that is public. In fact, it was included in an Answer to a SO Question re Time Series data, but the seeker got the moderators to excise it (it is embarrassing to Oracle). Here is the Benchmark Summary for the Sybase ASE vs Oracle 10.2 benchmark on a fixed DDL (Time Series data) and population.

Finally, the structures and code required are simple enough for you to run your own benchmark.

5 Response to Other Answers

Re Neville's comments:

However, if you also have to answer questions like "on what day was CPU above 30% while humidity was below 56% for more than 3 hours", your EAV model becomes really hard to work with. Those queries would rapidly become really hard to write and understand - every criterium becomes at least 1 self-join.

Noting that his comments regard EAV, but that it may imply that it applies equally to the subject table (an ordinary Relational database table (non-EAV) Reading) in this case, because it concerns the query type (and not the EAV concept vs the Relational concept):

  • The declaration does not apply to Relational tables (it may well apply to EAV; the masses of problems introduced due to Record IDs; etc)

  • As long as you have

    1. a genuine Relational database schema (as I have suggested), and
    2. a genuine SQL platform (not a pretend "sql", which does not comply but fraudulently uses the name), and
    3. you understand IN and NOT IN, and how to compare Sets in SQL
  • ... such queries are straight-forward to code.

6 Response to Comments

Record ID is Anti-Relational

Do you have any links on the record_id being anti-relational, I don't disbelieve you for a second but I'm interested to learn more about why this anti-pattern is so prevalent.

In this mess of anti-science, the academics manufacture and contrive various "solutions" to "problems", that do not exist in the Relational Model, and then you have a second level of endless "debates" about which correction to the non-problem is better or worse.

You don't need links because there is nothing to "debate", and whatever "debate" you might happen to read misses the above point.

The one and only authority is the great Dr E F Codd. All the authors of all books and textbooks alleging to be about the Relational Model, other than Codd, are actually false, they are about implementing 1970's style Record Filing Systems, and anti-Relational (no Relational Power; no Relational Integrity; no Relational Speed). They made the mistake, from 1970, of trying to fit the RM into their 1970's RFS mindset, rather than releasing it and taking on the RM mindset. And they have spent the last FIVE DECADES reinforcing that, even justifying it with "mathematical definitions"; 17 "relational algebras"; 42 abnormal "normal forms". All completely anti-Relational. And they cite each other, so they get published.

The second problem is, sites such as SO are predicated on the basis of populism. The popular answer is not the best or correct answer. For that you need an Authority (very scary to populists), and objective, absolute truth. (People love their relative or subjective "truths", that change all the time).

  1. Therefore, you need just the single, authoritative definition, the original paper, the Relational model.

    • Yes, the terms are out-dated, and not well understood these days.
    • Yes, it is seminal (every word counts, has deep meaning).
    • No, you need not read section 2 (math).
  2. You need to glean from that, that:

    • the Relational Key is “made up from the data” (my paraphrase, to the several entries, which are layers in the RM), which is Logical

    • that surrogates are (a) not only against that definition, (b) they are the pre-Relational paradigm, that is Physical pointers, the very thing the RM replaces, and (c) explicitly prohibited.

    • Very important, you need to understand not only the definition of the Relational Key, but the whys and the wherefores.

      • Eg. that it transcends import/export problems that pointer-based systems have.
      • Eg. the temporal definition (seminal; 8 letters; scary).
  3. Therefore, there is no argument, no "debate", to be had.

    • Anyone going against that is anti-Relational. Not because I say so, but because it contradicts evidenced facts, and the single Authority.
    • I have named the explicit technical benefits of using the RM correctly (Relational Power; Relational Integrity; Relational Speed), but an expansion of that requires a fair amount of effort
    • The consequence of NOT complying with the RM is, you get (a) none of the benefits, AND (b) you get the complete set of problems that pre-Relational Record Filing Systems had in 1970, AND (c) the contrived "solutions" supplied by the "academics" that have never worked.
  4. If you need an expansion of those benefits of the RM, which of course you do need to understand to some degree, because each one is very deep and very important, the best I can provide is this. As you can imagine, this is a battle that I have to fight on every Answer that relates to this subject, so I have posted a fair amount, over the years, across many Answers.

    • Go to my profile, select All Answers, and read any that relate to this subject.

Why is this Record ID anti-pattern so prevalent ?

The short answer is, people love their ignorance, their subjective "truths", and will fight tooth and nail to protect it. They quickly accept and repeat any justification for remaining the same. Learning something that is a paradigm shift away from what they know, is very scary, because it threatens their comfortable ignorance, and exposes it for what it really is. They will have to admit that what they have been writing for FIVE DECADES is wrong. That is why populism thrives. In ignorance.

The slightly longer answer is this. Just look at the internet. In the old days, for any particular subject, we had one source, one absolute authority: eg. buy the Encyclopædia Britannica; spend your entire childhood devouring it. Permanent truth. Honest history. But now anyone with a keyboard and two fingers plus some connective tissue (no brain required) can post. As an instant "authority". The web is chock-full of (a) superficial answers (the anti-thesis of "Now THAT is an answer") (b) in many flavours (c) that get upvoted due to populism (d) that are nowhere near the correct or full answer. Sound bites that can be easily understood by the populace. Very few want the depth of the full answer.

Even when an authority of sorts becomes established (eg. Wikipedia; Stack Overflow), it is easily subverted, because there are literally millions of people who change the entries (truth does not change, therefore, as long as something is changing, it is not truth). Mostly to serve their political positions; their ideologies; their re-writes of history to make the past wrong (it wasn't, it already happened), and the present insanity "good".

The definitive answer is this: academic envy. It took a whole decade for Codd's Relational Model to be understood and accepted. And even then, only by the few. IBM, and Britton-Lee (which became Sybase) implemented Codd's RM, in spirit and word. (Digital Equipment Corp did as well, but they are defunct.) Those academics who appeared to be working with Codd turned out to be actually working against him (by virtue of the evidence). They hated the fact that they did not come up with it themselves, that one man came up with the first real model, with a sound; logical; mathematical, foundation, complete with a Relational Algebra. All integrated. All requirements of the day (eg. the Bill Of Materials problem) answered. That has stood the test of time: five decades and nothing has been added or changed.

Typically they will declare, "but Codd did not define this or that, so here I am defining it ...". So they came up with their own RA. Now they have 17, all irrelevant. And abnormal "normal forms" to elevate fragmented bits of their Record Filing Systems to seem "relational". Now they have 42, all irrelevant. And many books, alleging to be "relational", but by evidenced fact, anti-Relational. Each "academic" seeks to reinforce their "academic" position, against all others.

Which is why I say, again, go to the one and only Authority. Read nothing from the anti-Relational crowd, because it will diminish your understanding of the RM (at best), or poison your mind (at worst).

One Clarification

If you examine a Relation PK (eg) Location.Location, it may seem odd. This is a %Code or %ShortName that is data, that the user actually uses. Usually 4 to 6 characters, max 12. As distinct from the long Name, which has to exist, and which is an Alternate Key. And of course, it is definitely not a number of any kind (which is not data, not something that the user uses). Users too, like their short forms. Obviously, use any International Standard if such exists.

The Key must be stable (not static, nothing in the universe is static), and one that is used in the real world to uniquely identify the object (data row).

  • Eg. for Security, which is a company listed on the stock exchange, in America, it would be TickerSymbol, in Australia ASXCode. The ISO code, an ISINCode, is an AlternateKey.

  • For cities, use one of the geographic location standards: ISO; FIPS; etc. (I use Statoids because it existed long before the others, but those days are numbered). At worst, use Airport Code.


Genuine SQL Platform

What do you consider to be genuine SQL? Sql Server, Postgres, MySQL, Oracle I guess all would be?

No. I mean any platform that actually complies with the published SQL Standard, and therefore can actually support relational tables; relational processing of Sets; and ACID Transactions.

  • That automatically excludes freeware/vapourware/nowhere/"open source", for which bits are written by 10,000 developers spread across the universe, with no governing principles. Eg. no ACID Transactions, or the structures that are required for it, which are required in every code segment. Too late to insert that now, because it will require a 100% re-write, and heaven forbid ... a Server Architecture.

Commercial
which means paid-for and supported, is also important. Either you have a maintenance contract and support is immediate, or you post a bug report and you check for updates every day for the next year or three.

Server Architecture
If either scalability or performance (high throughput; high concurrency; low latency) is required, then the Server Architecture is most important. Again, that excludes the freeware, and Oracle, because they have no Architecture, they are massive collections of interacting programs, that get the o/s to perform all the functions that a architected Database Server would normally perform.

Check this Comparison of Oracle vs Sybase Architecture.

  • The exact same applies to PostgreSQL and other freeware. PostgreSQL (son of the total failure Ingres) famously failed under pressure, with masses of locking problems and very low concurrency.

1 High-End, Commercial, SQL Compliant

Something like 5% market share, but 95% of the Financial Services and Automation markets. Great Architecture, hopeless marketing.

  • **Sybase ASE
  • IBM DB2**

2 Commercial, SQL Compliant

  • MS SQL Server
    Easily the most common. Good Architecture (originally stolen from Sybase) and then "progressed" in the usual insane MS style. Pain to use; masses of overhead; poorly integrated with various add-ons and must-uses.

3 Commercial, SQL Non-Compliant

Hopeless Architecture, great marketing.

  • Oracle
    Generally, Oracle developers are quite good at using the product in the ways that are required to get it to work, but that means they have strayed quite far away from the Relational Model.

    • Eg. in the Time Series benchmark, the whole point was, Oracle cacks itself when a Subquery is requested, so it has to use an "Inline View". Which the OP alleged was just as fast as a Subquery (avoiding the fact that it requires far more code, and the coder must step outside the Relational mindset). Which the benchmark proved to be hilariously false, in each scenario tested (Oracle was 3 to 4.8 times slower than Sybase on a COUNT(), 26 to 36 times slower on a SUM()
    • ...and the Subquery (Sybase 2.1 secs) had to be abandoned after 120 mins.
    • Eg. Oracle is non-compliant re ACID Transactions, and developers work around that obstacle to a degree, but Phantom Updates and Lost Updates (technical terms) are simply not prevented. If the work-arounds are not written properly, entire rows (UPDATES or INSERTS are lost).
    • All that applies to the below ...

4 Non-Commercial, SQL Non-Compliant

These guys spend an awful lot of time developing "features" that are not required for a Relational database, but very attractive to the anti-Relational Record ID Filing Systems.

  • Eg. "deferred constraint checking"; ENUMs; etc.
  • They lack the basics of SQL compliance. Eg. no genuine ACID Transactions.

Further, as explained above, zero Architecture. This results in systems that perform wonderfully under single-use, and fail miserably under any order of pressure from concurrency or scalability.

Due to their non-compliance with the SQL requirement, they take pains to post a notice of compliance on every page in the Commands manual. (Just one declaration of compliance at the front of the manual is all that is required.) Of course, the missing commands are simply missing, so gee whiz, they do not have a compliance declaration.

  • PostgreSQL
    The worst piece of software I have ever had to examine since the days of Ingres. Dearly loved by the "academic" crowd, simply because it was scrawled by a fellow "academic".
    5 user max, or deal with the concurrency problems (just take a cursory look at the problems reported on SO).

  • MySQL
    Head and shoulders above PostgreSQL, but still in this category.

    • The InnoDB engine is distinctly better in the performance department, but nowhere near the Sybase/DB2 level (still no genuine Server Architecture). No respite in the SQL non-compliance department.

5 Summary

You get what you pay for.

  • Server Architecture, most visibly, performance in every scenario.
  • SQL Compliance, thought through deeply, and implemented in every applicable code segment.
  • Last but not least, Support.

Whatever you choose, remember, when you port it to another platform, your SQL code will require a complete check-and-change, because the "flavours" of SQL (or NON-sql) are very different. For the Non-Commercial program suites, that means a complete rewrite. Therefore choose carefully, with the long term implementation in mind.

like image 40
PerformanceDBA Avatar answered Jan 03 '23 16:01

PerformanceDBA