I am designing some system that will store records containing a begin and end time. For example:
CREATE TABLE test (
id bigserial PRIMARY KEY,
ts_start timestamp NOT NULL,
ts_end timestamp NOT NULL,
foo bar NOT NULL,
...
);
Now I want to run queries on this to find all rows that overlap with a certain timestamp. This would result in a where clause like:
WHERE ts_start <= '2006-4-6 12:34:56' AND ts_end > '2006-4-6 12:34:56'
I did test this with a massive amount of generated test data and the performance is quite bad. I tested it with an index on ts_start and another index on ts_end and also with an multi column index on ts_start and ts_end. The last gave the best result but it is still far from optimal.
The problem is that postgresql doesn't know the fact that ts_end is guaranteed to be larger then ts_start so it uses a plan that is capable of finding rows where ts_end is smaller then ts_start.
Any suggestions how to solve this problem?
Edit: For people having this problem too if you can wait a little longer then PostgreSQL 9.2 has the perfect solution: range types. 9.2 is in beta now final release will most likely be at the end of 2012.
There was "temporal postgres" (google it) but I don't know if it is still maintained... I believe there was a discussion of including this type of search into postgres but I don't remember the final state of it. Anyway :
Example using box and gist :
CREATE TABLE segments( start INTEGER NOT NULL, stop INTEGER NOT NULL, range_box BOX NOT NULL );
INSERT INTO segments SELECT n,n+1,BOX(POINT(n,-1),POINT(n+1,1)) FROM generate_series( 1, 1000000 ) n;
CREATE INDEX segments_box ON segments USING gist( range_box );
CREATE INDEX segments_start ON segments(start);
CREATE INDEX segments_stop ON segments(stop);
EXPLAIN ANALYZE SELECT * FROM segments WHERE 300000 BETWEEN start AND stop;
Index Scan using segments_start on segments (cost=0.00..12959.24 rows=209597 width=72) (actual time=91.990..91.990 rows=2 loops=1)
Index Cond: (300000 >= start)
Filter: (300000 <= stop)
Total runtime: 92.023 ms
EXPLAIN ANALYZE SELECT * FROM segments WHERE range_box && '(300000,0,300000,0)'::BOX;
Bitmap Heap Scan on segments (cost=283.49..9740.27 rows=5000 width=72) (actual time=0.036..0.037 rows=2 loops=1)
Recheck Cond: (range_box && '(300000,0),(300000,0)'::box)
-> Bitmap Index Scan on segments_box (cost=0.00..282.24 rows=5000 width=0) (actual time=0.032..0.032 rows=2 loops=1)
Index Cond: (range_box && '(300000,0),(300000,0)'::box)
Total runtime: 0.064 ms
As you can see gist index is ridiculously fast here (1500 times ! lol) (and you can use many operators like overlaps, is contained, contains, etc.
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
Your encountering the same problem as someone trying to index line segments and then query if a point is in the segment. You just can't do that by indexing each dimension separately and you need something that indexes by constructing some sort of BSP structure.
I'm not sure if PG has any built in data type to support date ranges, but I'm certain that if you use PostGIS to represent the time ranges as points in 2D space and then tell PG to geo-index that, you'll get the best possible performance out of this query.
Maybe there's a date-specific equivalent of my suggestion built into pg, but, as I said, I'm not familiar with it. I am familiar with the geometrical indexing capabilities of pg, though, and I think you should consider it seriously as an optimization.
Here's a naive example (although I'm sure it will be very fast to query):
illustration:
|
|
|
|
to |
(timestamp) |
|
|
|_________________ (from,to)
|__ |
| |(p,p) |
|__|______________|_______________________
from (timestamp)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With