Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do relational databases provide a feasible backend for a process historian?

In the process industry, lots of data is read, often at a high frequency, from several different data sources, such as NIR instruments as well as common instruments for pH, temperature, and pressure measurements. This data is often stored in a process historian, usually for a long time.

Due to this, process historians have different requirements than relational databases. Most queries to a process historian require either time stamps or time ranges to operate on, as well as a set of variables of interest.

Frequent and many INSERT, many SELECT, few or no UPDATE, almost no DELETE.

Q1. Is relational databases a good backend for a process historian?


A very naive implementation of a process historian in SQL could be something like this.

+------------------------------------------------+
| Variable                                       |
+------------------------------------------------+
| Id : integer primary key                       |
| Name : nvarchar(32)                            |
+------------------------------------------------+

+------------------------------------------------+
| Data                                           |
+------------------------------------------------+
| Id : integer primary key                       |
| Time : datetime                                |
| VariableId : integer foreign key (Variable.Id) |
| Value : float                                  |
+------------------------------------------------+

This structure is very simple, but probably slow for normal process historian operations, as it lacks "sufficient" indexes.

But for example if the Variable table would consist of 1.000 rows (rather optimistic number), and data for all these 1.000 variables would be sampled once per minute (also an optimistic number) then the Data table would grow with 1.440.000 rows per day. Lets continue the example, estimate that each row would take about 16 bytes, which gives roughly 23 megabytes per day, not counting additional space for indexes and other overhead.

23 megabytes as such perhaps isn't that much but keep in mind that numbers of variables and samples in the example were optimistic and that the system will need to be operational 24/7/365.

Of course, archiving and compression comes to mind.

Q2. Is there a better way to accomplish this? Perhaps using some other table structure?

like image 892
dalle Avatar asked Jan 22 '23 15:01

dalle


2 Answers

I work with a SQL Server 2008 database that has similar characteristics; heavy on insertion and selection, light on update/delete. About 100,000 "nodes" all sampling at least once per hour. And there's a twist; all of the incoming data for each "node" needs to be correlated against the history and used for validation, forecasting, etc. Oh, there's another twist; the data needs to be represented in 4 different ways, so there are essentially 4 different copies of this data, none of which can be derived from any of the other data with reasonable accuracy and within reasonable time. 23 megabytes would be a cakewalk; we're talking hundreds-of-gigabytes to terabytes here.

You'll learn a lot about scale in the process, about what techniques work and what don't, but modern SQL databases are definitely up to the task. This system that I just described? It's running on a 5-year-old IBM xSeries with 2 GB of RAM and a RAID 5 array, and it performs admirably, nobody has to wait more than a few seconds for even the most complex queries.

You'll need to optimize, of course. You'll need to denormalize frequently, and maintain pre-computed aggregates (or a data warehouse) if that's part of your reporting requirement. You might need to think outside the box a little: for example, we use a number of custom CLR types for raw data storage and CLR aggregates/functions for some of the more unusual transactional reports. SQL Server and other DB engines might not offer everything you need up-front, but you can work around their limitations.

You'll also want to cache - heavily. Maintain hourly, daily, weekly summaries. Invest in a front-end server with plenty of memory and cache as many reports as you can. This is in addition to whatever data warehousing solution you come up with if applicable.

One of the things you'll probably want to get rid of is that "Id" key in your hypothetical Data table. My guess is that Data is a leaf table - it usually is in these scenarios - and this makes it one of the few situations where I'd recommend a natural key over a surrogate. The same variable probably can't generate duplicate rows for the same timestamp, so all you really need is the variable and timestamp as your primary key. As the table gets larger and larger, having a separate index on variable and timestamp (which of course needs to be covering) is going to waste enormous amounts of space - 20, 50, 100 GB, easily. And of course every INSERT now needs to update two or more indexes.

I really believe that an RDBMS (or SQL database, if you prefer) is as capable for this task as any other if you exercise sufficient care and planning in your design. If you just start slinging tables together without any regard for performance or scale, then of course you will get into trouble later, and when the database is several hundred GB it will be difficult to dig yourself out of that hole.

But is it feasible? Absolutely. Monitor the performance constantly and over time you will learn what optimizations you need to make.

like image 64
Aaronaught Avatar answered Feb 08 '23 16:02

Aaronaught


It sounds like you're talking about telemetry data (time stamps, data points).

We don't use SQL databases for this (although we do use SQL databases to organize it); instead, we use binary streaming files to capture the actual data. There are a number of binary file formats that are suitable for this, including HDF5 and CDF. The file format we use here is a proprietary compressible format. But then, we deal with hundreds of megabytes of telemetry data in one go.

You might find this article interesting (links directly to Microsoft Word document):
http://www.microsoft.com/caseStudies/ServeFileResource.aspx?4000003362

It is a case study from the McClaren group, describing how SQL Server 2008 is used to capture and process telemetry data from formula one race cars. Note that they don't actually store the telemetry data in the database; instead, it is stored in the file system, and the FILESTREAM capability of SQL Server 2008 is used to access it.

like image 22
Robert Harvey Avatar answered Feb 08 '23 16:02

Robert Harvey