Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL: How to structure and index time-related data for optimal query performance?

The Problem:

I have time-related data in my database and I am struggling to organize, structure and index that data in a way so that users can retrieve it efficiently; even simple database queries take longer than acceptable.

Project Context:

While this is a pure database question, some context might help to understand the data model:

The project centers around doing research on a big, complex machine. I don't know a lot about the machine itself, but rumour in the lab has it there's a flux capacitor in there somewhere - and I think yesterday, I spotted the tail of Schrödinger's cat hanging out of it at the side ;-)

We measure many different parameters while the machine is running using sensors positioned all over the machine at different measurement points (so-called spots) at certain intervals over a period of time. We use not only one device to measure these parameters, but a whole range of them; they differ in the quality of their measurement data (I think this involves sample rates, sensor quality, price and many other aspects that I'm not concerned with); one aim of the project actually is to establish a comparison between these devices. You can visualize these measurement devices as a bunch of lab trolleys, each with a lot of cables connected to the machine, each delivering measurement data.

The Data Model:

There is measurement data from every spot and every device for every parameter, for example once a minute over a period of 6 days. My job is to store that data in a database and to provide efficient access to it.

In a nutshell:

  • a device has a unique name
  • a parameter also has a name; they're not unique though, so it also has an ID
  • a spot has an ID

The project database is more complex of course, but these details don't seem relevant to the issue.

  • a measurement data index has an ID, a time stamp for when the measurement was done and references to the device and the spot on which the measurement was carried out
  • a measurement data value has a reference to the parameter and to the value that was actually measured

Initially, I had modeled the measurement data value to have its own ID as primary key; the n:m relationship between measurement data index and value was a separate table that only stored index:value ID pairs, but as that table itself consumed quite a lot of harddrive space, we eliminated it and changed the value ID to be a simple integer that stores the ID of the measurement data index it belongs to; the primary key of the measurement data value is now composed of that ID and the parameter ID.

On a side note: When I created the data model, I carefully followed common design guidelines like 3NF and appropriate table constraints (such as unique keys); another rule of thumb was to create an index for every foreign key. I have a suspicion that the deviation in the measurement data index / value tables from 'strict' 3NF might be one of the reasons for the performance issues I am looking at now, but changing the data model back has not solved the problem.

The Data Model in DDL:

NOTE: There is an update to this code further below.

The script below creates the database and all tables involved. Please note that there are no explicit indexes yet. Before you run this, please make sure you don't happen to already have a database called so_test with any valuable data...

\c postgres
DROP DATABASE IF EXISTS so_test;
CREATE DATABASE so_test;
\c so_test

CREATE TABLE device
(
  name VARCHAR(16) NOT NULL,
  CONSTRAINT device_pk PRIMARY KEY (name)
);

CREATE TABLE parameter
(
  -- must have ID as names are not unique
  id SERIAL,
  name VARCHAR(64) NOT NULL,
  CONSTRAINT parameter_pk PRIMARY KEY (id)
);

CREATE TABLE spot
(
  id SERIAL,
  CONSTRAINT spot_pk PRIMARY KEY (id)
);

CREATE TABLE measurement_data_index
(
  id SERIAL,
  fk_device_name VARCHAR(16) NOT NULL,
  fk_spot_id INTEGER NOT NULL,
  t_stamp TIMESTAMP NOT NULL,
  CONSTRAINT measurement_pk PRIMARY KEY (id),
  CONSTRAINT measurement_data_index_fk_2_device FOREIGN KEY (fk_device_name)
    REFERENCES device (name) MATCH FULL
    ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT measurement_data_index_fk_2_spot FOREIGN KEY (fk_spot_id)
    REFERENCES spot (id) MATCH FULL
    ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT measurement_data_index_uk_all_cols UNIQUE (fk_device_name, fk_spot_id, t_stamp)
);

CREATE TABLE measurement_data_value
(
  id INTEGER NOT NULL,
  fk_parameter_id INTEGER NOT NULL,
  value VARCHAR(16) NOT NULL,
  CONSTRAINT measurement_data_value_pk PRIMARY KEY (id, fk_parameter_id),
  CONSTRAINT measurement_data_value_fk_2_parameter FOREIGN KEY (fk_parameter_id)
    REFERENCES parameter (id) MATCH FULL
    ON UPDATE NO ACTION ON DELETE NO ACTION
);

I have also created a script to fill the table with some test data:

CREATE OR REPLACE FUNCTION insert_data()
RETURNS VOID
LANGUAGE plpgsql
AS
$BODY$
  DECLARE
    t_stamp  TIMESTAMP := '2012-01-01 00:00:00';
    index_id INTEGER;
    param_id INTEGER;
    dev_name VARCHAR(16);
    value    VARCHAR(16);
  BEGIN
    FOR dev IN 1..5
    LOOP
      INSERT INTO device (name) VALUES ('dev_' || to_char(dev, 'FM00'));
    END LOOP;
    FOR param IN 1..20
    LOOP
      INSERT INTO parameter (name) VALUES ('param_' || to_char(param, 'FM00'));
    END LOOP;
    FOR spot IN 1..10
    LOOP
      INSERT INTO spot (id) VALUES (spot);
    END LOOP;

    WHILE t_stamp < '2012-01-07 00:00:00'
    LOOP
      FOR dev IN 1..5
      LOOP
        dev_name := 'dev_' || to_char(dev, 'FM00');
        FOR spot IN 1..10
        LOOP
          INSERT INTO measurement_data_index
            (fk_device_name, fk_spot_id, t_stamp)
            VALUES (dev_name, spot, t_stamp) RETURNING id INTO index_id;
          FOR param IN 1..20
          LOOP
            SELECT id INTO param_id FROM parameter
              WHERE name = 'param_' || to_char(param, 'FM00');
            value := 'd'  || to_char(dev,   'FM00')
                  || '_s' || to_char(spot,  'FM00')
                  || '_p' || to_char(param, 'FM00');
            INSERT INTO measurement_data_value (id, fk_parameter_id, value)
              VALUES (index_id, param_id, value);
          END LOOP;
        END LOOP;
      END LOOP;
      t_stamp := t_stamp + '1 minute'::INTERVAL;
    END LOOP;

  END;
$BODY$;

SELECT insert_data();

The PostgreSQL query planner requires up to date statistics, so analyze all tables. Vacuuming might not be required, but do it anyway:

VACUUM ANALYZE device;
VACUUM ANALYZE measurement_data_index;
VACUUM ANALYZE measurement_data_value;
VACUUM ANALYZE parameter;
VACUUM ANALYZE spot;

A Sample Query:

If I now run a really simple query to e.g. obtain all value for a certain parameter, it already takes a couple of seconds, although the database is not very large yet:

EXPLAIN (ANALYZE ON, BUFFERS ON)
SELECT measurement_data_value.value
  FROM measurement_data_value, parameter
 WHERE measurement_data_value.fk_parameter_id = parameter.id
   AND parameter.name = 'param_01';

Exemplary result on my development machine (please see below for some details on my environment):

                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1.26..178153.26 rows=432000 width=12) (actual time=0.046..2281.281 rows=432000 loops=1)
   Hash Cond: (measurement_data_value.fk_parameter_id = parameter.id)
   Buffers: shared hit=55035
   ->  Seq Scan on measurement_data_value  (cost=0.00..141432.00 rows=8640000 width=16) (actual time=0.004..963.999 rows=8640000 loops=1)
         Buffers: shared hit=55032
   ->  Hash  (cost=1.25..1.25 rows=1 width=4) (actual time=0.010..0.010 rows=1 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 1kB
         Buffers: shared hit=1
         ->  Seq Scan on parameter  (cost=0.00..1.25 rows=1 width=4) (actual time=0.004..0.008 rows=1 loops=1)
               Filter: ((name)::text = 'param_01'::text)
               Buffers: shared hit=1
 Total runtime: 2313.615 ms
(12 rows)

There are no indexes in the database apart from the implicit ones, so it's not surprising the planner does sequential scans only. If I follow what seems to be a rule of thumb and add btree indexes for every foreign key like

CREATE INDEX measurement_data_index_idx_fk_device_name
    ON measurement_data_index (fk_device_name);
CREATE INDEX measurement_data_index_idx_fk_spot_id
    ON measurement_data_index (fk_spot_id);
CREATE INDEX measurement_data_value_idx_fk_parameter_id
    ON measurement_data_value (fk_parameter_id);

then do another vacuum analyze (just to be safe) and re-run the query, the planner uses bitmap heap and bitmap index scans and the total query time somewhat improves:

                                                                                   QUERY PLAN                                                                                   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=8089.19..72842.42 rows=431999 width=12) (actual time=66.773..1336.517 rows=432000 loops=1)
   Buffers: shared hit=55033 read=1184
   ->  Seq Scan on parameter  (cost=0.00..1.25 rows=1 width=4) (actual time=0.005..0.012 rows=1 loops=1)
         Filter: ((name)::text = 'param_01'::text)
         Buffers: shared hit=1
   ->  Bitmap Heap Scan on measurement_data_value  (cost=8089.19..67441.18 rows=431999 width=16) (actual time=66.762..1237.488 rows=432000 loops=1)
         Recheck Cond: (fk_parameter_id = parameter.id)
         Buffers: shared hit=55032 read=1184
         ->  Bitmap Index Scan on measurement_data_value_idx_fk_parameter_id  (cost=0.00..7981.19 rows=431999 width=0) (actual time=65.222..65.222 rows=432000 loops=1)
               Index Cond: (fk_parameter_id = parameter.id)
               Buffers: shared read=1184
 Total runtime: 1371.716 ms
(12 rows)

However, this is still more than a second of execution time for a really simple query.

What I have done so far:

  • got myself a copy of PostgreSQL 9.0 High Performance - great book!
  • did some basic PostgreSQL server configuration, see environment below
  • created a framework to run a series of performance tests using real queries from the project and to display the results graphically; these queries use devices, spots, parameters and a time interval as input parameters and the test series run over e.g. 5, 10 devices, 5, 10 spots, 5, 10, 15, 20 parameters and 1..7 days. The basic result is that they're all too slow, but their query plan was way too complex for me to understand, so I went back to the really simple query used above.

I have looked into partitioning the value table. The data is time-related and partitioning seems an appropriate means to organize that kind of data; even the examples in the PostgreSQL documentation use something similar. However, I read in the same article:

The benefits will normally be worthwhile only when a table would otherwise be very large. The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server.

The entire test database is less than 1GB in size and I am running my tests on a development machine with 8GB of RAM and on a virtual machine with 1GB (see also environment below), so the table is far from being very large or even exceeding the physical memory. I might implement partitioning anyway at some stage, but I have a feeling that approach does not target the performance problem itself.

Furthermore, I am considering to cluster the value table. I dislike the fact that clustering must be re-done whenever new data is inserted and that it furthermore requires an exclusive read/write lock, but looking at this SO question, it seems that it anyway has its benefits and might be an option. However, clustering is done on an index and as there are up to 4 selection criteria going into a query (devices, spots, parameters and time), I would have to create clusters for all of them - which in turn gives me the impression that I'm simply not creating the right indexes...

My Environment:

  • development is taking place on a MacBook Pro (mid-2009) with a dual-core CPU and 8GB of RAM
  • I am running database performance tests on a virtual Debian 6.0 machine with 1GB of RAM, hosted on the MBP
  • PostgreSQL version is 9.1 as that was the latest version when I installed it, upgrading to 9.2 would be possible
  • I have changed shared_buffers from the default 1600kB to 25% of RAM on both machines as recommended in the PostgreSQL docs (which involved enlarging kernel settings like SHMALL, SHMMAX, etc.)
  • similarly, I have changed effective_cache_size from the default 128MB to 50% of the RAM available
  • I ran performance test with different work_mem settings, but did not see any major difference in performance

NOTE: One aspect that I believe is important is that the performance test series with real queries from the project do not differ performance-wise between the MacBook with 8GB and the virtual machine with 1GB; i.e. if a query takes 10s on the MacBook, it also takes 10s on the VM. Also, I ran the same performance tests before and after changing shared_buffers, effective_cache_size and work_mem and the configuration changes did not improve performance by more than 10%; some results in fact even got worse, so it seems any difference is caused rather by test variation than by configuration change. These observations lead me to believe that RAM and postgres.conf settings are not the limiting factors here yet.

My Questions:

I don't know if different or additional indexes would speed up the query and if they did, which ones to create. Looking at the size of the database and how simple my query is, I have the impression there is something fundamentally wrong about my data model or how I have chosen my indexes so far.

Does anyone have some advice for me how to structure and index time-related my to improve query performance ?

Asked more broadly, is tuning query performance

  • usually done 'on an incident base', i.e. once a query does not perform satisfactorily ? It seems all my queries are too slow...
  • mainly a question of looking at (and understanding) query plans, then adding indexes and measuring if things improved, possibly accelerating the process by applying one's experience ?

How do I get this database to fly ?


Update 01:

Looking at the responses so far, I think I have not explained the need for measurement data index / values tables properly, so let me try again. Storage space is the issue here.

NOTE:

  • the figures used here are more of illustrative purpose and meant for comparison only, i.e. the numbers themselves are not relevant, what matters is the percental difference in storage requirements between using a single table vs. using an index and a value table
  • PostgreSQL data type storage sizes are documented in this chapter
  • this makes no claim to be scientifically correct, e.g. the units are probably mathematical bogus; the numbers should add up though

Assuming

  • 1 day of measurements
  • 1 set of measurements per minute
  • 10 devices
  • 10 parameters
  • 10 spots

This adds up to

1 meas/min x 60 min/hour x 24 hour/day = 1440 meas/day

Each measurement has data from every spot and every device for every parameter, so

10 spots x 10 devices x 10 parameters = 1000 data sets/meas

So in total

1440 meas/day x 1000 data sets/meas = 1 440 000 data sets/day

If we store all measurements in a single table as Catcall suggested, e.g.

CREATE TABLE measurement_data
(
  device_name character varying(16) NOT NULL,
  spot_id integer NOT NULL,
  parameter_id integer NOT NULL,
  t_stamp timestamp without time zone NOT NULL,
  value character varying(16) NOT NULL,
  -- constraints...
);

a single row would add up to

17 + 4 + 4 + 8 + 17 = 50 bytes/row

in the worst case where all varchar fields are fully filled. This amounts to

50 bytes/row x 1 440 000 rows/day = 72 000 000 bytes/day

or ~69 MB per day.

While this does not sound a lot, the storage space requirement in the real database would be prohibitive (again, the numbers used here are only for illustration). We have therefore split measurement data into an index and a value table as explained earlier in the question:

CREATE TABLE measurement_data_index
(
  id SERIAL,
  fk_device_name VARCHAR(16) NOT NULL,
  fk_spot_id INTEGER NOT NULL,
  t_stamp TIMESTAMP NOT NULL,
  -- constraints...
);

CREATE TABLE measurement_data_value
(
  id INTEGER NOT NULL,
  fk_parameter_id INTEGER NOT NULL,
  value VARCHAR(16) NOT NULL,
  -- constraints...
);

where the ID of a value row is equal to the ID of the index it belongs to.

The sizes of a row in the index and value tables are

index: 4 + 17 + 4 + 8 = 33 bytes
value: 4 + 4 + 17     = 25 bytes

(again, worst case scenario). The total amount of rows is

index: 10 devices x 10 spots x 1440 meas/day =   144 000 rows/day
value: 10 parameters x 144 000 rows/day      = 1 440 000 rows/day

so the total is

index: 33 bytes/row x   144 000 rows/day =  4 752 000 bytes/day
value: 25 bytes/row x 1 440 000 rows/day = 36 000 000 bytes/day
total:                                   = 40 752 000 bytes/day

or ~39 MB per day - as opposed to ~69 MB for a single table solution.


Update 02 (re: wildplassers response):

This question is getting pretty long as it is, so I was considering updating the code in place in the original question above, but I think it might help to have both the first and the improved solutions in here to better see the differences.

Changes compared to the original approach (somewhat in order of importance):

  • swap timestamp and parameter, i.e. move t_stamp field from measurement_data_index table to measurement_data_value and move fk_parameter_id field from value to index table: With this change, all fields in the index table are constant and new measurement data is written to the value table only. I did not expect any major query performance improvement from this (I was wrong), but I feel it makes the measurement data index concept clearer. While it requires fractionally more storage space (according to some rather crude estimate), having a 'static' index table might also help in deployment when tablespaces are moved to different harddrives according to their read/write requirements.
  • use a surrogate key in device table: From what I understand, a surrogate key is a primary key that is not strictly required from a database design point of view (e.g. device name is already unique, so it could also serve as PK), but might help to improve query performance. I added it because again, I feel it makes the concept clearer if the index table references IDs only (instead of some names and some IDs).
  • rewrite insert_data(): Use generate_series() instead of nested FOR loops; makes the code much 'snappier'.
  • As a side effect of these changes, inserting test data takes only about 50% of the time required by the first solution.
  • I did not add the view as wildplasser suggested; there's no backward compatibility required.
  • Additional indexes for the FKs in the index table seem to be ignored by the query planner and have no impact on query plan or performance.

(it seems without this line, the code below is not properly displayed as code on the SO page...)

\c postgres
DROP DATABASE IF EXISTS so_test_03;
CREATE DATABASE so_test_03;
\c so_test_03

CREATE TABLE device
(
  id SERIAL,
  name VARCHAR(16) NOT NULL,
  CONSTRAINT device_pk PRIMARY KEY (id),
  CONSTRAINT device_uk_name UNIQUE (name)
);

CREATE TABLE parameter
(
  id SERIAL,
  name VARCHAR(64) NOT NULL,
  CONSTRAINT parameter_pk PRIMARY KEY (id)
);

CREATE TABLE spot
(
  id SERIAL,
  name VARCHAR(16) NOT NULL,
  CONSTRAINT spot_pk PRIMARY KEY (id)
);

CREATE TABLE measurement_data_index
(
  id SERIAL,
  fk_device_id    INTEGER NOT NULL,
  fk_parameter_id INTEGER NOT NULL,
  fk_spot_id      INTEGER NOT NULL,
  CONSTRAINT measurement_pk PRIMARY KEY (id),
  CONSTRAINT measurement_data_index_fk_2_device FOREIGN KEY (fk_device_id)
    REFERENCES device (id) MATCH FULL
    ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT measurement_data_index_fk_2_parameter FOREIGN KEY (fk_parameter_id)
    REFERENCES parameter (id) MATCH FULL
    ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT measurement_data_index_fk_2_spot FOREIGN KEY (fk_spot_id)
    REFERENCES spot (id) MATCH FULL
    ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT measurement_data_index_uk_all_cols UNIQUE (fk_device_id, fk_parameter_id, fk_spot_id)
);

CREATE TABLE measurement_data_value
(
  id INTEGER NOT NULL,
  t_stamp TIMESTAMP NOT NULL,
  value VARCHAR(16) NOT NULL,
  -- NOTE: inverse field order compared to wildplassers version
  CONSTRAINT measurement_data_value_pk PRIMARY KEY (id, t_stamp),
  CONSTRAINT measurement_data_value_fk_2_index FOREIGN KEY (id)
    REFERENCES measurement_data_index (id) MATCH FULL
    ON UPDATE NO ACTION ON DELETE NO ACTION
);

CREATE OR REPLACE FUNCTION insert_data()
RETURNS VOID
LANGUAGE plpgsql
AS
$BODY$
  BEGIN
    INSERT INTO device (name)
    SELECT 'dev_' || to_char(item, 'FM00')
    FROM generate_series(1, 5) item;

    INSERT INTO parameter (name)
    SELECT 'param_' || to_char(item, 'FM00')
    FROM generate_series(1, 20) item;

    INSERT INTO spot (name)
    SELECT 'spot_' || to_char(item, 'FM00')
    FROM generate_series(1, 10) item;

    INSERT INTO measurement_data_index (fk_device_id, fk_parameter_id, fk_spot_id)
    SELECT device.id, parameter.id, spot.id
    FROM device, parameter, spot;

    INSERT INTO measurement_data_value(id, t_stamp, value)
    SELECT index.id,
           item,
           'd'  || to_char(index.fk_device_id,    'FM00') ||
           '_s' || to_char(index.fk_spot_id,      'FM00') ||
           '_p' || to_char(index.fk_parameter_id, 'FM00')
    FROM measurement_data_index index,
         generate_series('2012-01-01 00:00:00', '2012-01-06 23:59:59', interval '1 min') item;
  END;
$BODY$;

SELECT insert_data();

At some stage, I will change my own conventions to using inline PRIMARY KEY and REFERENCES statements instead of explicit CONSTRAINTs; for the moment, I think keeping this the way it was makes it easier to compare the two solutions.

Don't forget to update statistics for the query planner:

VACUUM ANALYZE device;
VACUUM ANALYZE measurement_data_index;
VACUUM ANALYZE measurement_data_value;
VACUUM ANALYZE parameter;
VACUUM ANALYZE spot;

Run a query that should produce the same result as the one in the first approach:

EXPLAIN (ANALYZE ON, BUFFERS ON)
SELECT measurement_data_value.value
  FROM measurement_data_index,
       measurement_data_value,
       parameter
 WHERE measurement_data_index.fk_parameter_id = parameter.id
   AND measurement_data_index.id = measurement_data_value.id
   AND parameter.name = 'param_01';

Result:

Nested Loop  (cost=0.00..34218.28 rows=431998 width=12) (actual time=0.026..696.349 rows=432000 loops=1)
  Buffers: shared hit=435332
  ->  Nested Loop  (cost=0.00..29.75 rows=50 width=4) (actual time=0.012..0.453 rows=50 loops=1)
        Join Filter: (measurement_data_index.fk_parameter_id = parameter.id)
        Buffers: shared hit=7
        ->  Seq Scan on parameter  (cost=0.00..1.25 rows=1 width=4) (actual time=0.005..0.010 rows=1 loops=1)
              Filter: ((name)::text = 'param_01'::text)
              Buffers: shared hit=1
        ->  Seq Scan on measurement_data_index  (cost=0.00..16.00 rows=1000 width=8) (actual time=0.003..0.187 rows=1000 loops=1)
              Buffers: shared hit=6
  ->  Index Scan using measurement_data_value_pk on measurement_data_value  (cost=0.00..575.77 rows=8640 width=16) (actual time=0.013..12.157 rows=8640 loops=50)
        Index Cond: (id = measurement_data_index.id)
        Buffers: shared hit=435325
Total runtime: 726.125 ms

This is almost half of the ~1.3s the first approach required; considering I'm loading 432K rows, it is a result I can live with for the moment.

NOTE: The field order in the value table PK is id, t_stamp; the order in wildplassers response is t_stamp, whw_id. I did this that way because I feel a 'regular' field order is the one in which fields are listed in the table declaration (and 'reverse' is then the other way around), but that's just my own convention that keeps me from getting confused. Either way, as Erwin Brandstetter pointed out, this order is absolutely critical for the performance improvement; if it is the wrong way around (and a reverse index as in wildplassers solution is missing), the query plan looks like below and performance is more than 3 times worse:

Hash Join  (cost=22.14..186671.54 rows=431998 width=12) (actual time=0.460..2570.941 rows=432000 loops=1)
  Hash Cond: (measurement_data_value.id = measurement_data_index.id)
  Buffers: shared hit=63537
  ->  Seq Scan on measurement_data_value  (cost=0.00..149929.58 rows=8639958 width=16) (actual time=0.004..1095.606 rows=8640000 loops=1)
        Buffers: shared hit=63530
  ->  Hash  (cost=21.51..21.51 rows=50 width=4) (actual time=0.446..0.446 rows=50 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 2kB
        Buffers: shared hit=7
        ->  Hash Join  (cost=1.26..21.51 rows=50 width=4) (actual time=0.015..0.359 rows=50 loops=1)
              Hash Cond: (measurement_data_index.fk_parameter_id = parameter.id)
              Buffers: shared hit=7
              ->  Seq Scan on measurement_data_index  (cost=0.00..16.00 rows=1000 width=8) (actual time=0.002..0.135 rows=1000 loops=1)
                    Buffers: shared hit=6
              ->  Hash  (cost=1.25..1.25 rows=1 width=4) (actual time=0.008..0.008 rows=1 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 1kB
                    Buffers: shared hit=1
                    ->  Seq Scan on parameter  (cost=0.00..1.25 rows=1 width=4) (actual time=0.004..0.007 rows=1 loops=1)
                          Filter: ((name)::text = 'param_01'::text)
                          Buffers: shared hit=1
Total runtime: 2605.277 ms
like image 456
ssc Avatar asked Sep 22 '12 13:09

ssc


People also ask

How do you efficiently store and query time series data?

Time series data is best stored in a time series database (TSDB) built specifically for handling metrics and events that are time-stamped. This is because time series data is often ingested in massive volumes that require a purpose-built database designed to handle that scale.

How make PostgreSQL query run faster?

Some of the tricks we used to speed up SELECT-s in PostgreSQL: LEFT JOIN with redundant conditions, VALUES, extended statistics, primary key type conversion, CLUSTER, pg_hint_plan + bonus. Photo by Richard Jacobs on Unsplash.

Is Postgres good for time series data?

Time series databases are perfect for many IoT seniors. The article explains how Postgres can be used as a time series database and how Postgres can generate and retrieve data and make a straightforward prediction model.

Does index improve query performance?

Indexing makes columns faster to query by creating pointers to where data is stored within a database. Imagine you want to find a piece of information that is within a large database. To get this information out of the database the computer will look through every row until it finds it.


1 Answers

I basically revised your whole setup. Tested under PostgreSQL 9.1.5.

DB schema

  • I think that your table layout has a major logical flaw (as also pointed out by @Catcall). I changed it the way I suspect it should be:
    Your last table measurement_data_value (which I renamed to measure_val) is supposed to save a value per parameter (now: param) for every row in measurement_data_index (now: measure). See below.

  • Even though "a device has a unique name" use an integer surrogate primary key anyway. Text strings are inherently bulkier and slower to be used as foreign keys in big tables. They are also subject to collation, which can slow down queries significantly.

    Under this related question we found that joining and sorting on a medium sized text column was the major slow-down. If you insist on using a text string as primary key, read up on collation support in PostgreSQL 9.1 or later.

  • Don't fall for the anti-pattern of using id as name for a primary key. When you join a couple of tables (like you will have to do a lot!) you end up with several columns name id - what a mess! (Sadly, some ORMs use it.)

    Instead, name a surrogate primary key column after the table somehow to make it meaningful on its own. Then you can have foreign keys referencing it have the same name (that's a good, as they contain the same data).

    CREATE TABLE spot
    ( spot_id SERIAL PRIMARY KEY);
  • Don't use super-long identifiers. They are hard to type and hard to read. Rule of thumb: as long a necessary to be clear, as short as possible.

  • Don't use varchar(n) if you don't have a compelling reason. Just use varchar, or simpler: just text.

All this and more went into my proposal for a better db schema:

CREATE TABLE device
( device_id serial PRIMARY KEY 
 ,device text NOT NULL
);

CREATE TABLE param
( param_id serial PRIMARY KEY
 ,param text NOT NULL
);
CREATE INDEX param_param_idx ON param (param); -- you are looking up by name!

CREATE TABLE spot
( spot_id  serial PRIMARY KEY);

CREATE TABLE measure
( measure_id serial PRIMARY KEY
 ,device_id int NOT NULL REFERENCES device (device_id) ON UPDATE CASCADE
 ,spot_id int NOT NULL REFERENCES spot (spot_id) ON UPDATE CASCADE
 ,t_stamp timestamp NOT NULL
 ,CONSTRAINT measure_uni UNIQUE (device_id, spot_id, t_stamp)
);

CREATE TABLE measure_val   -- better name? 
( measure_id int NOT NULL REFERENCES measure (measure_id)
                 ON UPDATE CASCADE ON DELETE CASCADE  -- guessing it fits
 ,param_id int NOT NULL REFERENCES param (param_id)
                 ON UPDATE CASCADE ON DELETE CASCADE  -- guessing it fits
 ,value text NOT NULL
 ,CONSTRAINT measure_val_pk PRIMARY KEY (measure_id, param_id)
);
CREATE INDEX measure_val_param_id_idx ON measure_val (param_id);  -- !crucial!

I renamed the bulky measurement_data_value to measure_val, because that's what's in the table: parameter-values for measurements. Now, the multi-column pk makes sense, too.

But I added a separate index on param_id. The way you had it, column param_id was the second column in a multi-column index, which leads to poor results for param_id. Read all the gory details about that under this related question on dba.SE.

After implementing this alone, your query should be faster. But there is more you can do.

Test data

This fills in the data much faster. The point is that I use set-based DML commands, executing mass-inserts instead of loops that execute individual inserts, which takes forever. Makes quite a difference for the considerable amount of test data you want to insert. It's also much shorter and simpler.

To make it even more efficient, I use a data-modifying CTE (new in Postgres 9.1) that instantly reuses the massive amount of rows in the last step.

CREATE OR REPLACE FUNCTION insert_data()
RETURNS void LANGUAGE plpgsql AS
$BODY$
BEGIN
   INSERT INTO device (device)
   SELECT 'dev_' || to_char(g, 'FM00')
   FROM generate_series(1,5) g;

   INSERT INTO param (param)
   SELECT 'param_' || to_char(g, 'FM00')
   FROM generate_series(1,20) g;

   INSERT INTO spot (spot_id)
   SELECT nextval('spot_spot_id_seq'::regclass)
   FROM generate_series(1,10) g; -- to set sequence, too

   WITH x AS (
      INSERT INTO measure (device_id, spot_id, t_stamp)
      SELECT d.device_id, s.spot_id, g
      FROM   device    d
      CROSS  JOIN spot s
      CROSS  JOIN generate_series('2012-01-06 23:00:00' -- smaller set
                                 ,'2012-01-07 00:00:00' -- for quick tests
                                 ,interval '1 min') g
      RETURNING *
      )
   INSERT INTO measure_val (measure_id, param_id, value)
   SELECT x.measure_id
         ,p.param_id
         ,x.device_id || '_' || x.spot_id || '_' || p.param
   FROM  x
   CROSS JOIN param p;
END
$BODY$;

Call:

SELECT insert_data();

Query

  • Use explicit JOIN syntax and table aliased to make your queries easier to read and debug:
SELECT v.value
FROM   param p
JOIN   measure_val v USING (param_id)
WHERE  p.param = 'param_01';

The USING clause is just for simplifying the syntax, but not superior to ON otherwise.

This should be much faster now for two reasons:

  • Index param_param_idx on param.param.
  • Index measure_val_param_id_idx on measure_val.param_id, like explained in detail here.

Edit after feedback

My major oversight was that you already had added the crucial index in form of measurement_data_value_idx_fk_parameter_id further down in your question. (I blame your cryptic names! :p ) On closer inspection, you have more than 10M (7 * 24 * 60 * 5 * 10 * 20) rows in your test setup and your query retrieves > 500K. I only tested with a much smaller subset.

Also, as you retrieve 5% of the whole table, indexes will only go so far. I was to optimistic, such an amount of data is bound to take some time. Is it a realistic requirement that you query 500k rows? I would assume you aggregate in your real life application?

Further options

  • Partitioning.
  • More RAM and settings that make use of it.

    A virtual Debian 6.0 machine with 1GB of RAM

    is way below what you need.

  • Partial indexes, especially in connection with index-only scans of PostgreSQL 9.2.

  • Materialized views of aggregated data. Obviously, you are not going to display 500K rows, but some kind of aggregation. You can compute that once and save results in a materialized view, from where you can retrieve data much faster.
  • If your queries are predominantly by parameter (like the example), you could use CLUSTER to physically rewrite the table according to an index:

    CLUSTER measure_val USING measure_val_param_id_idx
    

    This way all rows for one parameter are stored in succession. Means fewer block to read and easier to cache. Should make the query at hand much faster. Or INSERT the rows in favorable order to begin with, to the same effect.
    Partitioning would mix well with CLUSTER, since you would not have to rewrite the whole (huge) table every time. As your data is obviously just inserted and not updated, a partition would stay "in order" after CLUSTER.

  • Generally, PostgreSQL 9.2 should be great for you as its improvements focus on performance with big data.

like image 101
Erwin Brandstetter Avatar answered Oct 18 '22 19:10

Erwin Brandstetter