I have an issue with a Postgres query I'm trying to run - I've tried lots of way of framing the problem, but with no joy so far.
I have managed to write some queries that work, but the crux is one of performance - the queries that do work are far too slow to be useable.
I have a table called events_hub
that links to separate tables containing information regarding different events. Different events are distinguished by different event_types
. These events are also grouped into aggregates, and the aggregates are distinguished by aggregate_id
.
My basic problem is that I want to find the earliest time associated with event 1 for each aggregate group, and then count the number of occurrences of event 2 in a window of time leading up to that time (e.g. counting the number of times event 2 happens in a 24 hour period before the earliest occurrence of an aggregate group).
The events hub table looks something like the following:
| aggregate_id | event_id | event_type | event_time |
-------------------------------------------------------
| 1 | 1 | 1 | 1st Jan |
| 1 | 2 | 1 | 2nd Jan |
| 2 | 3 | 1 | 2nd Jan |
| 2 | 4 | 1 | 3rd Jan |
| null | 5 | 2 | 30th Dec |
| null | 6 | 2 | 31st Dec |
| null | 7 | 2 | 1st Jan |
| null | 8 | 2 | 1st Jan |
-------------------------------------------------------
In the toy example above, I'd want to return:
| aggregate_id | count_of_event2 |
----------------------------------
| 1 | 3 |
| 2 | 2 |
----------------------------------
Because the earliest occurrence of aggregate_id 1 has 3 occurrences of event_type 2 in the day before, while aggregate_id 2 only has 2 occurrences of it.
Approach 1
My first attempt involves using joins surrounded by a group by. The following query runs extremely quickly, but does not return exactly what I want:
SELECT
aggregate_id,
count(aggregate_id)
FROM
(SELECT
aggregate_id,
min(event_time) as time_of_event1
FROM events_hub WHERE event_type = 1
GROUP BY aggregate_id) as t1
LEFT JOIN
(SELECT event_time as time_of_event2
FROM events_hub WHERE event_type = 2) as t2
ON t2.time_of_event2 BETWEEN t1.time_of_event1 - INTERVAL '24 hours'
AND t1.time_of_event1
GROUP BY aggregate_id
Running EXPLAIN ANALYZE
on this returns the following (please note that the SQL queries in this question are reduced versions of the actual queries I'd like to run - there are a few extra restrictions on the tables that therefore appear in the explain plan):
HashAggregate (cost=1262545.21..1262547.21 rows=200 width=15) (actual time=536.206..539.222 rows=2824 loops=1)
Group Key: events_hub_1.aggregate_id
-> Nested Loop Left Join (cost=9137.36..1191912.59 rows=14126523 width=15) (actual time=15.419..395.895 rows=111948 loops=1)
-> HashAggregate (cost=9136.80..9141.42 rows=462 width=23) (actual time=15.387..19.316 rows=2824 loops=1)
Group Key: events_hub_1.aggregate_id
-> Index Only Scan using comp_index1 on events_hub events_hub_1 (cost=0.56..9110.87 rows=5186 width=23) (actual time=2.669..9.750 rows=4412 loops=1)
Index Cond: ((event_type_code = 5) AND (event_datetime >= '2013-01-01 00:00:00'::timestamp without time zone) AND (event_datetime <= '2013-01-02 00:00:00'::timestamp without time zone) AND (aggregate_id IS NOT NULL))
Heap Fetches: 4412
-> Index Only Scan using comp_index on events_hub (cost=0.56..2254.33 rows=30577 width=8) (actual time=0.005..0.049 rows=40 loops=2824)
Index Cond: ((event_type_code = 3) AND (event_datetime <= (min(events_hub_1.event_datetime))) AND (event_datetime >= ((min(events_hub_1.event_datetime)) - '12:00:00'::interval)))
Heap Fetches: 0
Planning time: 0.326 ms
Execution time: 542.020 ms
This is not particularly surprising, as I have a composite index (event_type, event_time)
on the events hub, so the relatively complicated join condition based on the relative times of the 2 events runs quickly.
However, when I try and add another condition to the query based on some of the attributes of event 2 (to get to the result I need), the query drastically slows down (as in the above query is done in a flash, whereas the below will runs for minutes):
SELECT
aggregate_id,
count(aggregate_id)
FROM
(SELECT
aggregate_id,
min(event_time) as time_of_event1
FROM events_hub WHERE event_type = 1
GROUP BY aggregate_id) as t1
LEFT JOIN
(SELECT event_id, event_time as time_of_event2
FROM events_hub WHERE event_type = 2) as t2
ON t2.time_of_event2 BETWEEN t1.time_of_event1 - INTERVAL '24 hours'
AND t1.time_of_event1
INNER JOIN
(SELECT event_id FROM event_2_attributes WHERE some_flag = TRUE) as t3
ON t2.event_id = t3.event_id
GROUP BY aggregate_id
For this query, the EXPLAIN ANALYZE
query returns:
HashAggregate (cost=33781.17..33783.17 rows=200 width=15) (actual time=479888.736..479891.819 rows=2824 loops=1)
Group Key: events_hub_1.aggregate_id
-> Nested Loop (cost=9625.94..33502.10 rows=55815 width=15) (actual time=346721.414..479857.494 rows=26164 loops=1)
Join Filter: ((events_hub.event_datetime <= (min(events_hub_1.event_datetime))) AND (events_hub.event_datetime >= ((min(events_hub_1.event_datetime)) - '12:00:00'::interval)))
Rows Removed by Join Filter: 209062796
-> Merge Join (cost=489.14..14311.03 rows=1087 width=8) (actual time=1.360..1571.387 rows=74040 loops=1)
Merge Cond: (events_hub.event_id = arrests.event_id)
-> Index Scan using comp_index4 on events_hub (cost=0.44..290158.71 rows=275192 width=12) (actual time=1.344..512.787 rows=282766 loops=1)
Index Cond: (event_type_code = 3)
-> Index Scan using arrests_events_id_index on arrests (cost=0.42..11186.59 rows=73799 width=4) (actual time=0.008..456.550 rows=74040 loops=1)
Filter: felony_flag
Rows Removed by Filter: 210238
-> Materialize (cost=9136.80..9148.35 rows=462 width=23) (actual time=0.001..3.002 rows=2824 loops=74040)
-> HashAggregate (cost=9136.80..9141.42 rows=462 width=23) (actual time=10.963..14.006 rows=2824 loops=1)
Group Key: events_hub_1.aggregate_id
-> Index Only Scan using comp_index1 on events_hub events_hub_1 (cost=0.56..9110.87 rows=5186 width=23) (actual time=0.018..5.405 rows=4412 loops=1)
Index Cond: ((event_type_code = 5) AND (event_datetime >= '2013-01-01 00:00:00'::timestamp without time zone) AND (event_datetime <= '2013-01-02 00:00:00'::timestamp without time zone) AND (aggregate_id IS NOT NULL))
Heap Fetches: 4412
Planning time: 12.548 ms
Execution time: 479894.888 ms
Please note that when the inner join is included, less data is actually being returned. And yet it still runs so much slower.
I've fiddled with nesting these joins inside each other, and switching things around so that there is a RIGHT JOIN
rather than a LEFT JOIN
, but it doesn't make any difference.
I've also tried CTE expressions for each subquery to try and force the execution order, but no luck there either.
Approach 2
As a second approach, I try using a subquery that returns the count of event 2:
SELECT
t1.aggregate_id,
(SELECT count(t3.event_id)
FROM (SELECT event_id FROM events_hub AS t2 WHERE t2.event_type = 2
AND t2.event_time BETWEEN t1.time_of_event1 - INTERVAL '24 hours'
AND t1.time_of_event1) as t3
INNER JOIN event_2_attributes as t4
ON t3.event_id = t4.event_id
WHERE t4.some_flag = TRUE) as count_column
FROM
(SELECT
aggregate_id,
min(event_time) as time_of_event1
FROM events_hub WHERE event_type = 1
GROUP BY aggregate_id) as t1
This works pretty well, and runs in around 15 seconds. However, when I try and take the results and insert them into another table (which is required for what I do next), the query takes a huge amount of time to run:
CREATE TABLE tbl AS
< query above >
This is baffling to me!
I've tried to run EXPLAIN ANALYZE
on this query, but have gotten to 2000 seconds before quitting. As above though, without EXPLAIN ANALYZE
this runs in 15 seconds.
Approach 3
As a final approach, I've tried using a lateral join as follows (without the group by here):
WITH t1 AS
(SELECT
aggregate_id,
min(event_time) as time_of_event1
FROM events_hub WHERE event_type = 1
GROUP BY aggregate_id)
SELECT
t1.aggregate_id,
t2.event_time
FROM t1
LEFT JOIN LATERAL
(SELECT event_time FROM
(SELECT event_id, event_time FROM events_hub WHERE event_type = 2) as t3
INNER JOIN
(SELECT event_id FROM event_2_attributes WHERE some_flag = TRUE) as t4
ON t3.event_id = t4.event_id
WHERE t3.event_time BETWEEN t1.time_of_event1 - INTERVAL '24 hours'
AND t1.time_of_event1
) as t2
ON TRUE
This query runs, but again, very, very slowly - even without the group by operation.
Any light you could shed on these (possibly unrelated?) would be hugely appreciated. It's probably worth mentioning that each individual column in the events hub is indexed.
Many thanks!
Lack of indexes If there aren't any indexes on the pertinent columns, the RDBMS may have to resort to full table scans which could be slow.
A general rule is that joins are faster in most cases (99%). The more data tables have, the subqueries are slower. The less data tables have, the subqueries have equivalent speed as joins. The subqueries are simpler, easier to understand, and easier to read.
The retrieval time of the query using joins almost always will be faster than that of a subquery. By using joins, you can maximize the calculation burden on the database i.e., instead of multiple queries using one join query.
The Verdict. I won't leave you in suspense, between Joins and Subqueries, joins tend to execute faster. In fact, query retrieval time using joins will almost always outperform one that employs a subquery.
OK, I've figured this out.
Although not the 'neatest' of solutions, the final trick was to create a table containing the results of the initial GROUP BY
operation which returns the earliest time associated with an aggregate_id
:
CREATE TABLE earliest_time AS
(SELECT
aggregate_id,
min(event_time) as time_of_event1
FROM events_hub WHERE event_type = 1
GROUP BY aggregate_id)
And then to add indexes onto both the aggregate_id
and time_of_event1
columns.
This table was then used as per approach 1 above.
Having the subquery already materialised helps the planner pick the most efficient path, and the execution time drops by 2 orders of magnitude.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With