The Problem:
I have time-related data in my database and I am struggling to organize, structure and index that data in a way so that users can retrieve it efficiently; even simple database queries take longer than acceptable.
Project Context:
While this is a pure database question, some context might help to understand the data model:
The project centers around doing research on a big, complex machine. I don't know a lot about the machine itself, but rumour in the lab has it there's a flux capacitor in there somewhere - and I think yesterday, I spotted the tail of Schrödinger's cat hanging out of it at the side ;-)
We measure many different parameters while the machine is running using sensors positioned all over the machine at different measurement points (so-called spots) at certain intervals over a period of time. We use not only one device to measure these parameters, but a whole range of them; they differ in the quality of their measurement data (I think this involves sample rates, sensor quality, price and many other aspects that I'm not concerned with); one aim of the project actually is to establish a comparison between these devices. You can visualize these measurement devices as a bunch of lab trolleys, each with a lot of cables connected to the machine, each delivering measurement data.
The Data Model:
There is measurement data from every spot and every device for every parameter, for example once a minute over a period of 6 days. My job is to store that data in a database and to provide efficient access to it.
In a nutshell:
The project database is more complex of course, but these details don't seem relevant to the issue.
Initially, I had modeled the measurement data value to have its own ID as primary key; the n:m
relationship between measurement data index and value was a separate table that only stored index:value
ID pairs, but as that table itself consumed quite a lot of harddrive space, we eliminated it and changed the value ID to be a simple integer that stores the ID of the measurement data index it belongs to; the primary key of the measurement data value is now composed of that ID and the parameter ID.
On a side note: When I created the data model, I carefully followed common design guidelines like 3NF and appropriate table constraints (such as unique keys); another rule of thumb was to create an index for every foreign key. I have a suspicion that the deviation in the measurement data index / value tables from 'strict' 3NF might be one of the reasons for the performance issues I am looking at now, but changing the data model back has not solved the problem.
The Data Model in DDL:
NOTE: There is an update to this code further below.
The script below creates the database and all tables involved. Please note that there are no explicit indexes yet. Before you run this, please make sure you don't happen to already have a database called so_test
with any valuable data...
\c postgres
DROP DATABASE IF EXISTS so_test;
CREATE DATABASE so_test;
\c so_test
CREATE TABLE device
(
name VARCHAR(16) NOT NULL,
CONSTRAINT device_pk PRIMARY KEY (name)
);
CREATE TABLE parameter
(
-- must have ID as names are not unique
id SERIAL,
name VARCHAR(64) NOT NULL,
CONSTRAINT parameter_pk PRIMARY KEY (id)
);
CREATE TABLE spot
(
id SERIAL,
CONSTRAINT spot_pk PRIMARY KEY (id)
);
CREATE TABLE measurement_data_index
(
id SERIAL,
fk_device_name VARCHAR(16) NOT NULL,
fk_spot_id INTEGER NOT NULL,
t_stamp TIMESTAMP NOT NULL,
CONSTRAINT measurement_pk PRIMARY KEY (id),
CONSTRAINT measurement_data_index_fk_2_device FOREIGN KEY (fk_device_name)
REFERENCES device (name) MATCH FULL
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT measurement_data_index_fk_2_spot FOREIGN KEY (fk_spot_id)
REFERENCES spot (id) MATCH FULL
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT measurement_data_index_uk_all_cols UNIQUE (fk_device_name, fk_spot_id, t_stamp)
);
CREATE TABLE measurement_data_value
(
id INTEGER NOT NULL,
fk_parameter_id INTEGER NOT NULL,
value VARCHAR(16) NOT NULL,
CONSTRAINT measurement_data_value_pk PRIMARY KEY (id, fk_parameter_id),
CONSTRAINT measurement_data_value_fk_2_parameter FOREIGN KEY (fk_parameter_id)
REFERENCES parameter (id) MATCH FULL
ON UPDATE NO ACTION ON DELETE NO ACTION
);
I have also created a script to fill the table with some test data:
CREATE OR REPLACE FUNCTION insert_data()
RETURNS VOID
LANGUAGE plpgsql
AS
$BODY$
DECLARE
t_stamp TIMESTAMP := '2012-01-01 00:00:00';
index_id INTEGER;
param_id INTEGER;
dev_name VARCHAR(16);
value VARCHAR(16);
BEGIN
FOR dev IN 1..5
LOOP
INSERT INTO device (name) VALUES ('dev_' || to_char(dev, 'FM00'));
END LOOP;
FOR param IN 1..20
LOOP
INSERT INTO parameter (name) VALUES ('param_' || to_char(param, 'FM00'));
END LOOP;
FOR spot IN 1..10
LOOP
INSERT INTO spot (id) VALUES (spot);
END LOOP;
WHILE t_stamp < '2012-01-07 00:00:00'
LOOP
FOR dev IN 1..5
LOOP
dev_name := 'dev_' || to_char(dev, 'FM00');
FOR spot IN 1..10
LOOP
INSERT INTO measurement_data_index
(fk_device_name, fk_spot_id, t_stamp)
VALUES (dev_name, spot, t_stamp) RETURNING id INTO index_id;
FOR param IN 1..20
LOOP
SELECT id INTO param_id FROM parameter
WHERE name = 'param_' || to_char(param, 'FM00');
value := 'd' || to_char(dev, 'FM00')
|| '_s' || to_char(spot, 'FM00')
|| '_p' || to_char(param, 'FM00');
INSERT INTO measurement_data_value (id, fk_parameter_id, value)
VALUES (index_id, param_id, value);
END LOOP;
END LOOP;
END LOOP;
t_stamp := t_stamp + '1 minute'::INTERVAL;
END LOOP;
END;
$BODY$;
SELECT insert_data();
The PostgreSQL query planner requires up to date statistics, so analyze all tables. Vacuuming might not be required, but do it anyway:
VACUUM ANALYZE device;
VACUUM ANALYZE measurement_data_index;
VACUUM ANALYZE measurement_data_value;
VACUUM ANALYZE parameter;
VACUUM ANALYZE spot;
A Sample Query:
If I now run a really simple query to e.g. obtain all value for a certain parameter, it already takes a couple of seconds, although the database is not very large yet:
EXPLAIN (ANALYZE ON, BUFFERS ON)
SELECT measurement_data_value.value
FROM measurement_data_value, parameter
WHERE measurement_data_value.fk_parameter_id = parameter.id
AND parameter.name = 'param_01';
Exemplary result on my development machine (please see below for some details on my environment):
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=1.26..178153.26 rows=432000 width=12) (actual time=0.046..2281.281 rows=432000 loops=1)
Hash Cond: (measurement_data_value.fk_parameter_id = parameter.id)
Buffers: shared hit=55035
-> Seq Scan on measurement_data_value (cost=0.00..141432.00 rows=8640000 width=16) (actual time=0.004..963.999 rows=8640000 loops=1)
Buffers: shared hit=55032
-> Hash (cost=1.25..1.25 rows=1 width=4) (actual time=0.010..0.010 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
Buffers: shared hit=1
-> Seq Scan on parameter (cost=0.00..1.25 rows=1 width=4) (actual time=0.004..0.008 rows=1 loops=1)
Filter: ((name)::text = 'param_01'::text)
Buffers: shared hit=1
Total runtime: 2313.615 ms
(12 rows)
There are no indexes in the database apart from the implicit ones, so it's not surprising the planner does sequential scans only. If I follow what seems to be a rule of thumb and add btree indexes for every foreign key like
CREATE INDEX measurement_data_index_idx_fk_device_name
ON measurement_data_index (fk_device_name);
CREATE INDEX measurement_data_index_idx_fk_spot_id
ON measurement_data_index (fk_spot_id);
CREATE INDEX measurement_data_value_idx_fk_parameter_id
ON measurement_data_value (fk_parameter_id);
then do another vacuum analyze (just to be safe) and re-run the query, the planner uses bitmap heap and bitmap index scans and the total query time somewhat improves:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=8089.19..72842.42 rows=431999 width=12) (actual time=66.773..1336.517 rows=432000 loops=1)
Buffers: shared hit=55033 read=1184
-> Seq Scan on parameter (cost=0.00..1.25 rows=1 width=4) (actual time=0.005..0.012 rows=1 loops=1)
Filter: ((name)::text = 'param_01'::text)
Buffers: shared hit=1
-> Bitmap Heap Scan on measurement_data_value (cost=8089.19..67441.18 rows=431999 width=16) (actual time=66.762..1237.488 rows=432000 loops=1)
Recheck Cond: (fk_parameter_id = parameter.id)
Buffers: shared hit=55032 read=1184
-> Bitmap Index Scan on measurement_data_value_idx_fk_parameter_id (cost=0.00..7981.19 rows=431999 width=0) (actual time=65.222..65.222 rows=432000 loops=1)
Index Cond: (fk_parameter_id = parameter.id)
Buffers: shared read=1184
Total runtime: 1371.716 ms
(12 rows)
However, this is still more than a second of execution time for a really simple query.
What I have done so far:
I have looked into partitioning the value table. The data is time-related and partitioning seems an appropriate means to organize that kind of data; even the examples in the PostgreSQL documentation use something similar. However, I read in the same article:
The benefits will normally be worthwhile only when a table would otherwise be very large. The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server.
The entire test database is less than 1GB in size and I am running my tests on a development machine with 8GB of RAM and on a virtual machine with 1GB (see also environment below), so the table is far from being very large or even exceeding the physical memory. I might implement partitioning anyway at some stage, but I have a feeling that approach does not target the performance problem itself.
Furthermore, I am considering to cluster the value table. I dislike the fact that clustering must be re-done whenever new data is inserted and that it furthermore requires an exclusive read/write lock, but looking at this SO question, it seems that it anyway has its benefits and might be an option. However, clustering is done on an index and as there are up to 4 selection criteria going into a query (devices, spots, parameters and time), I would have to create clusters for all of them - which in turn gives me the impression that I'm simply not creating the right indexes...
My Environment:
shared_buffers
from the default 1600kB to 25% of RAM on both machines as recommended in the PostgreSQL docs (which involved enlarging kernel settings like SHMALL, SHMMAX, etc.)NOTE: One aspect that I believe is important is that the performance test series with real queries from the project do not differ performance-wise between the MacBook with 8GB and the virtual machine with 1GB; i.e. if a query takes 10s on the MacBook, it also takes 10s on the VM. Also, I ran the same performance tests before and after changing shared_buffers
, effective_cache_size
and work_mem
and the configuration changes did not improve performance by more than 10%; some results in fact even got worse, so it seems any difference is caused rather by test variation than by configuration change. These observations lead me to believe that RAM and postgres.conf
settings are not the limiting factors here yet.
My Questions:
I don't know if different or additional indexes would speed up the query and if they did, which ones to create. Looking at the size of the database and how simple my query is, I have the impression there is something fundamentally wrong about my data model or how I have chosen my indexes so far.
Does anyone have some advice for me how to structure and index time-related my to improve query performance ?
Asked more broadly, is tuning query performance
How do I get this database to fly ?
Update 01:
Looking at the responses so far, I think I have not explained the need for measurement data index / values tables properly, so let me try again. Storage space is the issue here.
NOTE:
Assuming
This adds up to
1 meas/min x 60 min/hour x 24 hour/day = 1440 meas/day
Each measurement has data from every spot and every device for every parameter, so
10 spots x 10 devices x 10 parameters = 1000 data sets/meas
So in total
1440 meas/day x 1000 data sets/meas = 1 440 000 data sets/day
If we store all measurements in a single table as Catcall suggested, e.g.
CREATE TABLE measurement_data
(
device_name character varying(16) NOT NULL,
spot_id integer NOT NULL,
parameter_id integer NOT NULL,
t_stamp timestamp without time zone NOT NULL,
value character varying(16) NOT NULL,
-- constraints...
);
a single row would add up to
17 + 4 + 4 + 8 + 17 = 50 bytes/row
in the worst case where all varchar fields are fully filled. This amounts to
50 bytes/row x 1 440 000 rows/day = 72 000 000 bytes/day
or ~69 MB per day.
While this does not sound a lot, the storage space requirement in the real database would be prohibitive (again, the numbers used here are only for illustration). We have therefore split measurement data into an index and a value table as explained earlier in the question:
CREATE TABLE measurement_data_index
(
id SERIAL,
fk_device_name VARCHAR(16) NOT NULL,
fk_spot_id INTEGER NOT NULL,
t_stamp TIMESTAMP NOT NULL,
-- constraints...
);
CREATE TABLE measurement_data_value
(
id INTEGER NOT NULL,
fk_parameter_id INTEGER NOT NULL,
value VARCHAR(16) NOT NULL,
-- constraints...
);
where the ID of a value row is equal to the ID of the index it belongs to.
The sizes of a row in the index and value tables are
index: 4 + 17 + 4 + 8 = 33 bytes value: 4 + 4 + 17 = 25 bytes
(again, worst case scenario). The total amount of rows is
index: 10 devices x 10 spots x 1440 meas/day = 144 000 rows/day value: 10 parameters x 144 000 rows/day = 1 440 000 rows/day
so the total is
index: 33 bytes/row x 144 000 rows/day = 4 752 000 bytes/day value: 25 bytes/row x 1 440 000 rows/day = 36 000 000 bytes/day total: = 40 752 000 bytes/day
or ~39 MB per day - as opposed to ~69 MB for a single table solution.
Update 02 (re: wildplassers response):
This question is getting pretty long as it is, so I was considering updating the code in place in the original question above, but I think it might help to have both the first and the improved solutions in here to better see the differences.
Changes compared to the original approach (somewhat in order of importance):
t_stamp
field from measurement_data_index
table to measurement_data_value
and move fk_parameter_id
field from value to index table: With this change, all fields in the index table are constant and new measurement data is written to the value table only. I did not expect any major query performance improvement from this (I was wrong), but I feel it makes the measurement data index concept clearer. While it requires fractionally more storage space (according to some rather crude estimate), having a 'static' index table might also help in deployment when tablespaces are moved to different harddrives according to their read/write requirements.insert_data()
: Use generate_series()
instead of nested FOR
loops; makes the code much 'snappier'.(it seems without this line, the code below is not properly displayed as code on the SO page...)
\c postgres
DROP DATABASE IF EXISTS so_test_03;
CREATE DATABASE so_test_03;
\c so_test_03
CREATE TABLE device
(
id SERIAL,
name VARCHAR(16) NOT NULL,
CONSTRAINT device_pk PRIMARY KEY (id),
CONSTRAINT device_uk_name UNIQUE (name)
);
CREATE TABLE parameter
(
id SERIAL,
name VARCHAR(64) NOT NULL,
CONSTRAINT parameter_pk PRIMARY KEY (id)
);
CREATE TABLE spot
(
id SERIAL,
name VARCHAR(16) NOT NULL,
CONSTRAINT spot_pk PRIMARY KEY (id)
);
CREATE TABLE measurement_data_index
(
id SERIAL,
fk_device_id INTEGER NOT NULL,
fk_parameter_id INTEGER NOT NULL,
fk_spot_id INTEGER NOT NULL,
CONSTRAINT measurement_pk PRIMARY KEY (id),
CONSTRAINT measurement_data_index_fk_2_device FOREIGN KEY (fk_device_id)
REFERENCES device (id) MATCH FULL
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT measurement_data_index_fk_2_parameter FOREIGN KEY (fk_parameter_id)
REFERENCES parameter (id) MATCH FULL
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT measurement_data_index_fk_2_spot FOREIGN KEY (fk_spot_id)
REFERENCES spot (id) MATCH FULL
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT measurement_data_index_uk_all_cols UNIQUE (fk_device_id, fk_parameter_id, fk_spot_id)
);
CREATE TABLE measurement_data_value
(
id INTEGER NOT NULL,
t_stamp TIMESTAMP NOT NULL,
value VARCHAR(16) NOT NULL,
-- NOTE: inverse field order compared to wildplassers version
CONSTRAINT measurement_data_value_pk PRIMARY KEY (id, t_stamp),
CONSTRAINT measurement_data_value_fk_2_index FOREIGN KEY (id)
REFERENCES measurement_data_index (id) MATCH FULL
ON UPDATE NO ACTION ON DELETE NO ACTION
);
CREATE OR REPLACE FUNCTION insert_data()
RETURNS VOID
LANGUAGE plpgsql
AS
$BODY$
BEGIN
INSERT INTO device (name)
SELECT 'dev_' || to_char(item, 'FM00')
FROM generate_series(1, 5) item;
INSERT INTO parameter (name)
SELECT 'param_' || to_char(item, 'FM00')
FROM generate_series(1, 20) item;
INSERT INTO spot (name)
SELECT 'spot_' || to_char(item, 'FM00')
FROM generate_series(1, 10) item;
INSERT INTO measurement_data_index (fk_device_id, fk_parameter_id, fk_spot_id)
SELECT device.id, parameter.id, spot.id
FROM device, parameter, spot;
INSERT INTO measurement_data_value(id, t_stamp, value)
SELECT index.id,
item,
'd' || to_char(index.fk_device_id, 'FM00') ||
'_s' || to_char(index.fk_spot_id, 'FM00') ||
'_p' || to_char(index.fk_parameter_id, 'FM00')
FROM measurement_data_index index,
generate_series('2012-01-01 00:00:00', '2012-01-06 23:59:59', interval '1 min') item;
END;
$BODY$;
SELECT insert_data();
At some stage, I will change my own conventions to using inline PRIMARY KEY
and REFERENCES
statements instead of explicit CONSTRAINT
s; for the moment, I think keeping this the way it was makes it easier to compare the two solutions.
Don't forget to update statistics for the query planner:
VACUUM ANALYZE device;
VACUUM ANALYZE measurement_data_index;
VACUUM ANALYZE measurement_data_value;
VACUUM ANALYZE parameter;
VACUUM ANALYZE spot;
Run a query that should produce the same result as the one in the first approach:
EXPLAIN (ANALYZE ON, BUFFERS ON)
SELECT measurement_data_value.value
FROM measurement_data_index,
measurement_data_value,
parameter
WHERE measurement_data_index.fk_parameter_id = parameter.id
AND measurement_data_index.id = measurement_data_value.id
AND parameter.name = 'param_01';
Result:
Nested Loop (cost=0.00..34218.28 rows=431998 width=12) (actual time=0.026..696.349 rows=432000 loops=1)
Buffers: shared hit=435332
-> Nested Loop (cost=0.00..29.75 rows=50 width=4) (actual time=0.012..0.453 rows=50 loops=1)
Join Filter: (measurement_data_index.fk_parameter_id = parameter.id)
Buffers: shared hit=7
-> Seq Scan on parameter (cost=0.00..1.25 rows=1 width=4) (actual time=0.005..0.010 rows=1 loops=1)
Filter: ((name)::text = 'param_01'::text)
Buffers: shared hit=1
-> Seq Scan on measurement_data_index (cost=0.00..16.00 rows=1000 width=8) (actual time=0.003..0.187 rows=1000 loops=1)
Buffers: shared hit=6
-> Index Scan using measurement_data_value_pk on measurement_data_value (cost=0.00..575.77 rows=8640 width=16) (actual time=0.013..12.157 rows=8640 loops=50)
Index Cond: (id = measurement_data_index.id)
Buffers: shared hit=435325
Total runtime: 726.125 ms
This is almost half of the ~1.3s the first approach required; considering I'm loading 432K rows, it is a result I can live with for the moment.
NOTE: The field order in the value table PK is id, t_stamp
; the order in wildplassers response is t_stamp, whw_id
. I did this that way because I feel a 'regular' field order is the one in which fields are listed in the table declaration (and 'reverse' is then the other way around), but that's just my own convention that keeps me from getting confused. Either way, as Erwin Brandstetter pointed out, this order is absolutely critical for the performance improvement; if it is the wrong way around (and a reverse index as in wildplassers solution is missing), the query plan looks like below and performance is more than 3 times worse:
Hash Join (cost=22.14..186671.54 rows=431998 width=12) (actual time=0.460..2570.941 rows=432000 loops=1)
Hash Cond: (measurement_data_value.id = measurement_data_index.id)
Buffers: shared hit=63537
-> Seq Scan on measurement_data_value (cost=0.00..149929.58 rows=8639958 width=16) (actual time=0.004..1095.606 rows=8640000 loops=1)
Buffers: shared hit=63530
-> Hash (cost=21.51..21.51 rows=50 width=4) (actual time=0.446..0.446 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 2kB
Buffers: shared hit=7
-> Hash Join (cost=1.26..21.51 rows=50 width=4) (actual time=0.015..0.359 rows=50 loops=1)
Hash Cond: (measurement_data_index.fk_parameter_id = parameter.id)
Buffers: shared hit=7
-> Seq Scan on measurement_data_index (cost=0.00..16.00 rows=1000 width=8) (actual time=0.002..0.135 rows=1000 loops=1)
Buffers: shared hit=6
-> Hash (cost=1.25..1.25 rows=1 width=4) (actual time=0.008..0.008 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
Buffers: shared hit=1
-> Seq Scan on parameter (cost=0.00..1.25 rows=1 width=4) (actual time=0.004..0.007 rows=1 loops=1)
Filter: ((name)::text = 'param_01'::text)
Buffers: shared hit=1
Total runtime: 2605.277 ms
Time series data is best stored in a time series database (TSDB) built specifically for handling metrics and events that are time-stamped. This is because time series data is often ingested in massive volumes that require a purpose-built database designed to handle that scale.
Some of the tricks we used to speed up SELECT-s in PostgreSQL: LEFT JOIN with redundant conditions, VALUES, extended statistics, primary key type conversion, CLUSTER, pg_hint_plan + bonus. Photo by Richard Jacobs on Unsplash.
Time series databases are perfect for many IoT seniors. The article explains how Postgres can be used as a time series database and how Postgres can generate and retrieve data and make a straightforward prediction model.
Indexing makes columns faster to query by creating pointers to where data is stored within a database. Imagine you want to find a piece of information that is within a large database. To get this information out of the database the computer will look through every row until it finds it.
I basically revised your whole setup. Tested under PostgreSQL 9.1.5.
I think that your table layout has a major logical flaw (as also pointed out by @Catcall). I changed it the way I suspect it should be:
Your last table measurement_data_value
(which I renamed to measure_val
) is supposed to save a value per parameter
(now: param
) for every row in measurement_data_index
(now: measure
). See below.
Even though "a device has a unique name" use an integer surrogate primary key anyway. Text strings are inherently bulkier and slower to be used as foreign keys in big tables. They are also subject to collation, which can slow down queries significantly.
Under this related question we found that joining and sorting on a medium sized text
column was the major slow-down.
If you insist on using a text string as primary key, read up on collation support in PostgreSQL 9.1 or later.
Don't fall for the anti-pattern of using id
as name for a primary key. When you join a couple of tables (like you will have to do a lot!) you end up with several columns name id
- what a mess! (Sadly, some ORMs use it.)
Instead, name a surrogate primary key column after the table somehow to make it meaningful on its own. Then you can have foreign keys referencing it have the same name (that's a good, as they contain the same data).
CREATE TABLE spot
( spot_id SERIAL PRIMARY KEY);
Don't use super-long identifiers. They are hard to type and hard to read. Rule of thumb: as long a necessary to be clear, as short as possible.
Don't use varchar(n)
if you don't have a compelling reason. Just use varchar
, or simpler: just text
.
All this and more went into my proposal for a better db schema:
CREATE TABLE device
( device_id serial PRIMARY KEY
,device text NOT NULL
);
CREATE TABLE param
( param_id serial PRIMARY KEY
,param text NOT NULL
);
CREATE INDEX param_param_idx ON param (param); -- you are looking up by name!
CREATE TABLE spot
( spot_id serial PRIMARY KEY);
CREATE TABLE measure
( measure_id serial PRIMARY KEY
,device_id int NOT NULL REFERENCES device (device_id) ON UPDATE CASCADE
,spot_id int NOT NULL REFERENCES spot (spot_id) ON UPDATE CASCADE
,t_stamp timestamp NOT NULL
,CONSTRAINT measure_uni UNIQUE (device_id, spot_id, t_stamp)
);
CREATE TABLE measure_val -- better name?
( measure_id int NOT NULL REFERENCES measure (measure_id)
ON UPDATE CASCADE ON DELETE CASCADE -- guessing it fits
,param_id int NOT NULL REFERENCES param (param_id)
ON UPDATE CASCADE ON DELETE CASCADE -- guessing it fits
,value text NOT NULL
,CONSTRAINT measure_val_pk PRIMARY KEY (measure_id, param_id)
);
CREATE INDEX measure_val_param_id_idx ON measure_val (param_id); -- !crucial!
I renamed the bulky measurement_data_value
to measure_val
, because that's what's in the table: parameter-values for measurements. Now, the multi-column pk makes sense, too.
But I added a separate index on param_id
. The way you had it, column param_id
was the second column in a multi-column index, which leads to poor results for param_id
. Read all the gory details about that under this related question on dba.SE.
After implementing this alone, your query should be faster. But there is more you can do.
This fills in the data much faster. The point is that I use set-based DML commands, executing mass-inserts instead of loops that execute individual inserts, which takes forever. Makes quite a difference for the considerable amount of test data you want to insert. It's also much shorter and simpler.
To make it even more efficient, I use a data-modifying CTE (new in Postgres 9.1) that instantly reuses the massive amount of rows in the last step.
CREATE OR REPLACE FUNCTION insert_data()
RETURNS void LANGUAGE plpgsql AS
$BODY$
BEGIN
INSERT INTO device (device)
SELECT 'dev_' || to_char(g, 'FM00')
FROM generate_series(1,5) g;
INSERT INTO param (param)
SELECT 'param_' || to_char(g, 'FM00')
FROM generate_series(1,20) g;
INSERT INTO spot (spot_id)
SELECT nextval('spot_spot_id_seq'::regclass)
FROM generate_series(1,10) g; -- to set sequence, too
WITH x AS (
INSERT INTO measure (device_id, spot_id, t_stamp)
SELECT d.device_id, s.spot_id, g
FROM device d
CROSS JOIN spot s
CROSS JOIN generate_series('2012-01-06 23:00:00' -- smaller set
,'2012-01-07 00:00:00' -- for quick tests
,interval '1 min') g
RETURNING *
)
INSERT INTO measure_val (measure_id, param_id, value)
SELECT x.measure_id
,p.param_id
,x.device_id || '_' || x.spot_id || '_' || p.param
FROM x
CROSS JOIN param p;
END
$BODY$;
Call:
SELECT insert_data();
JOIN
syntax and table aliased to make your queries easier to read and debug:SELECT v.value
FROM param p
JOIN measure_val v USING (param_id)
WHERE p.param = 'param_01';
The USING
clause is just for simplifying the syntax, but not superior to ON
otherwise.
This should be much faster now for two reasons:
param_param_idx
on param.param
.measure_val_param_id_idx
on measure_val.param_id
, like explained in detail here.My major oversight was that you already had added the crucial index in form of measurement_data_value_idx_fk_parameter_id
further down in your question. (I blame your cryptic names! :p ) On closer inspection, you have more than 10M (7 * 24 * 60 * 5 * 10 * 20) rows in your test setup and your query retrieves > 500K. I only tested with a much smaller subset.
Also, as you retrieve 5% of the whole table, indexes will only go so far. I was to optimistic, such an amount of data is bound to take some time. Is it a realistic requirement that you query 500k rows? I would assume you aggregate in your real life application?
More RAM and settings that make use of it.
A virtual Debian 6.0 machine with 1GB of RAM
is way below what you need.
Partial indexes, especially in connection with index-only scans of PostgreSQL 9.2.
If your queries are predominantly by parameter (like the example), you could use CLUSTER
to physically rewrite the table according to an index:
CLUSTER measure_val USING measure_val_param_id_idx
This way all rows for one parameter are stored in succession. Means fewer block to read and easier to cache. Should make the query at hand much faster. Or INSERT
the rows in favorable order to begin with, to the same effect.
Partitioning would mix well with CLUSTER
, since you would not have to rewrite the whole (huge) table every time. As your data is obviously just inserted and not updated, a partition would stay "in order" after CLUSTER
.
Generally, PostgreSQL 9.2 should be great for you as its improvements focus on performance with big data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With