Postgres subqueries running extremely slowly when joined

Tags:

I have an issue with a Postgres query I'm trying to run - I've tried lots of way of framing the problem, but with no joy so far.

I have managed to write some queries that work, but the crux is one of performance - the queries that do work are far too slow to be useable.

I have a table called events_hub that links to separate tables containing information regarding different events. Different events are distinguished by different event_types. These events are also grouped into aggregates, and the aggregates are distinguished by aggregate_id.

My basic problem is that I want to find the earliest time associated with event 1 for each aggregate group, and then count the number of occurrences of event 2 in a window of time leading up to that time (e.g. counting the number of times event 2 happens in a 24 hour period before the earliest occurrence of an aggregate group).

The events hub table looks something like the following:

| aggregate_id | event_id |  event_type  | event_time |
-------------------------------------------------------
|      1       |     1    |       1      |  1st Jan   |
|      1       |     2    |       1      |  2nd Jan   |
|      2       |     3    |       1      |  2nd Jan   |
|      2       |     4    |       1      |  3rd Jan   |
|      null    |     5    |       2      |  30th Dec  |
|      null    |     6    |       2      |  31st Dec  |
|      null    |     7    |       2      |  1st Jan   |
|      null    |     8    |       2      |  1st Jan   |
-------------------------------------------------------

In the toy example above, I'd want to return:

| aggregate_id | count_of_event2 |
----------------------------------
|      1       |        3        |
|      2       |        2        |
----------------------------------

Because the earliest occurrence of aggregate_id 1 has 3 occurrences of event_type 2 in the day before, while aggregate_id 2 only has 2 occurrences of it.

Approach 1

My first attempt involves using joins surrounded by a group by. The following query runs extremely quickly, but does not return exactly what I want:

SELECT
    aggregate_id,
    count(aggregate_id)
FROM
    (SELECT
        aggregate_id,
        min(event_time) as time_of_event1
     FROM events_hub WHERE event_type = 1
     GROUP BY aggregate_id) as t1
     LEFT JOIN
    (SELECT event_time as time_of_event2
     FROM events_hub WHERE event_type = 2) as t2
     ON t2.time_of_event2 BETWEEN t1.time_of_event1 - INTERVAL '24 hours'
                          AND t1.time_of_event1
GROUP BY aggregate_id

Running EXPLAIN ANALYZE on this returns the following (please note that the SQL queries in this question are reduced versions of the actual queries I'd like to run - there are a few extra restrictions on the tables that therefore appear in the explain plan):

HashAggregate  (cost=1262545.21..1262547.21 rows=200 width=15) (actual time=536.206..539.222 rows=2824 loops=1)
  Group Key: events_hub_1.aggregate_id
  ->  Nested Loop Left Join  (cost=9137.36..1191912.59 rows=14126523 width=15) (actual time=15.419..395.895 rows=111948 loops=1)
        ->  HashAggregate  (cost=9136.80..9141.42 rows=462 width=23) (actual time=15.387..19.316 rows=2824 loops=1)
              Group Key: events_hub_1.aggregate_id
              ->  Index Only Scan using comp_index1 on events_hub events_hub_1  (cost=0.56..9110.87 rows=5186 width=23) (actual time=2.669..9.750 rows=4412 loops=1)
                    Index Cond: ((event_type_code = 5) AND (event_datetime >= '2013-01-01 00:00:00'::timestamp without time zone) AND (event_datetime <= '2013-01-02 00:00:00'::timestamp without time zone) AND (aggregate_id IS NOT NULL))
                    Heap Fetches: 4412
        ->  Index Only Scan using comp_index on events_hub  (cost=0.56..2254.33 rows=30577 width=8) (actual time=0.005..0.049 rows=40 loops=2824)
              Index Cond: ((event_type_code = 3) AND (event_datetime <= (min(events_hub_1.event_datetime))) AND (event_datetime >= ((min(events_hub_1.event_datetime)) - '12:00:00'::interval)))
              Heap Fetches: 0
Planning time: 0.326 ms
Execution time: 542.020 ms

This is not particularly surprising, as I have a composite index (event_type, event_time) on the events hub, so the relatively complicated join condition based on the relative times of the 2 events runs quickly.

However, when I try and add another condition to the query based on some of the attributes of event 2 (to get to the result I need), the query drastically slows down (as in the above query is done in a flash, whereas the below will runs for minutes):

SELECT
    aggregate_id,
    count(aggregate_id)
FROM
    (SELECT
        aggregate_id,
        min(event_time) as time_of_event1
     FROM events_hub WHERE event_type = 1
     GROUP BY aggregate_id) as t1
     LEFT JOIN
    (SELECT event_id, event_time as time_of_event2
     FROM events_hub WHERE event_type = 2) as t2
     ON t2.time_of_event2 BETWEEN t1.time_of_event1 - INTERVAL '24 hours'
                          AND t1.time_of_event1
     INNER JOIN
     (SELECT event_id FROM event_2_attributes WHERE some_flag = TRUE) as t3
     ON t2.event_id = t3.event_id
GROUP BY aggregate_id

For this query, the EXPLAIN ANALYZE query returns:

HashAggregate  (cost=33781.17..33783.17 rows=200 width=15) (actual time=479888.736..479891.819 rows=2824 loops=1)
  Group Key: events_hub_1.aggregate_id
  ->  Nested Loop  (cost=9625.94..33502.10 rows=55815 width=15) (actual time=346721.414..479857.494 rows=26164 loops=1)
        Join Filter: ((events_hub.event_datetime <= (min(events_hub_1.event_datetime))) AND (events_hub.event_datetime >= ((min(events_hub_1.event_datetime)) - '12:00:00'::interval)))
        Rows Removed by Join Filter: 209062796
        ->  Merge Join  (cost=489.14..14311.03 rows=1087 width=8) (actual time=1.360..1571.387 rows=74040 loops=1)
              Merge Cond: (events_hub.event_id = arrests.event_id)
              ->  Index Scan using comp_index4 on events_hub  (cost=0.44..290158.71 rows=275192 width=12) (actual time=1.344..512.787 rows=282766 loops=1)
                    Index Cond: (event_type_code = 3)
              ->  Index Scan using arrests_events_id_index on arrests  (cost=0.42..11186.59 rows=73799 width=4) (actual time=0.008..456.550 rows=74040 loops=1)
                    Filter: felony_flag
                    Rows Removed by Filter: 210238
        ->  Materialize  (cost=9136.80..9148.35 rows=462 width=23) (actual time=0.001..3.002 rows=2824 loops=74040)
              ->  HashAggregate  (cost=9136.80..9141.42 rows=462 width=23) (actual time=10.963..14.006 rows=2824 loops=1)
                    Group Key: events_hub_1.aggregate_id
                    ->  Index Only Scan using comp_index1 on events_hub events_hub_1  (cost=0.56..9110.87 rows=5186 width=23) (actual time=0.018..5.405 rows=4412 loops=1)
                          Index Cond: ((event_type_code = 5) AND (event_datetime >= '2013-01-01 00:00:00'::timestamp without time zone) AND (event_datetime <= '2013-01-02 00:00:00'::timestamp without time zone) AND (aggregate_id IS NOT NULL))
                          Heap Fetches: 4412
Planning time: 12.548 ms
Execution time: 479894.888 ms

Please note that when the inner join is included, less data is actually being returned. And yet it still runs so much slower.

I've fiddled with nesting these joins inside each other, and switching things around so that there is a RIGHT JOIN rather than a LEFT JOIN, but it doesn't make any difference.

I've also tried CTE expressions for each subquery to try and force the execution order, but no luck there either.

Approach 2

As a second approach, I try using a subquery that returns the count of event 2:

SELECT
    t1.aggregate_id,
    (SELECT count(t3.event_id)
    FROM (SELECT event_id FROM events_hub AS t2 WHERE t2.event_type = 2
          AND t2.event_time BETWEEN t1.time_of_event1 - INTERVAL '24 hours'
                            AND t1.time_of_event1) as t3
          INNER JOIN event_2_attributes as t4
          ON t3.event_id = t4.event_id
          WHERE t4.some_flag = TRUE) as count_column
FROM
    (SELECT
        aggregate_id,
        min(event_time) as time_of_event1
     FROM events_hub WHERE event_type = 1
     GROUP BY aggregate_id) as t1

This works pretty well, and runs in around 15 seconds. However, when I try and take the results and insert them into another table (which is required for what I do next), the query takes a huge amount of time to run:

CREATE TABLE tbl AS
    < query above >

This is baffling to me!

I've tried to run EXPLAIN ANALYZE on this query, but have gotten to 2000 seconds before quitting. As above though, without EXPLAIN ANALYZE this runs in 15 seconds.

Approach 3

As a final approach, I've tried using a lateral join as follows (without the group by here):

WITH t1 AS
(SELECT
    aggregate_id,
    min(event_time) as time_of_event1
FROM events_hub WHERE event_type = 1
GROUP BY aggregate_id)
SELECT
    t1.aggregate_id,
    t2.event_time
FROM t1
LEFT JOIN LATERAL
    (SELECT event_time FROM
        (SELECT event_id, event_time FROM events_hub WHERE event_type = 2) as t3
        INNER JOIN
        (SELECT event_id FROM event_2_attributes WHERE some_flag = TRUE) as t4
        ON t3.event_id = t4.event_id
    WHERE t3.event_time BETWEEN t1.time_of_event1 - INTERVAL '24 hours'
                        AND t1.time_of_event1
    ) as t2
ON TRUE

This query runs, but again, very, very slowly - even without the group by operation.

Any light you could shed on these (possibly unrelated?) would be hugely appreciated. It's probably worth mentioning that each individual column in the events hub is indexed.

Many thanks!

703

asked Aug 05 '16 15:08

Ned Yoxall

1 Answers

OK, I've figured this out.

Although not the 'neatest' of solutions, the final trick was to create a table containing the results of the initial GROUP BY operation which returns the earliest time associated with an aggregate_id:

CREATE TABLE earliest_time AS
(SELECT
    aggregate_id,
    min(event_time) as time_of_event1
 FROM events_hub WHERE event_type = 1
 GROUP BY aggregate_id)

And then to add indexes onto both the aggregate_id and time_of_event1 columns.

This table was then used as per approach 1 above.

Having the subquery already materialised helps the planner pick the most efficient path, and the execution time drops by 2 orders of magnitude.

199

answered Sep 29 '22 11:09

Ned Yoxall

Related questions
                            
                                How do I convert a database from format iso to db3?
                            
                                Replicate T-SQL DecryptByPassPhrase in C#
                            
                                rewrite Hibernate query without huge list parameter
                            
                                SQL Server 2008: Joining results of STORED PROCEDURE (dynamic columns) with results of a SELECT statement
                            
                                How to enable cascade delete in TPT (table per type) inheritance in code first?
                            
                                Slowly changing Dimension always updating
                            
                                Logic for a complex query which involves grouping and average in SQL
                            
                                Listing files in a specified directory using PL/SQL
                            
                                Computed columns sometimes missing from SELECT *
                            
                                Database Version / Change Control for Data not Schema?
                            
                                Why does MySQL LEFT JOIN not return all rows unless there is WHERE clause - phpMyAdmin issue
                            
                                Cannot connect to SQL Server Express 2014 from Visual Studio 2012
                            
                                Is it possible to add SignalR messages directly to the SQL Backplane?
                            
                                find and replace from another table mysql
                            
                                Automatically refresh Excel ODC connections and pivots without opening the file PowerShell
                            
                                Syntax error in SQL statement - H2 error 42001
                            
                                TSQL query returning double the rows with inner join of same table twice
                            
                                SQL MOVE Records to another table
                            
                                SQL query in Pro-C fails with Error:02115
                            
                                MySQL select multiple rows within a row

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Postgres subqueries running extremely slowly when joined

Tags:

sql

postgresql

postgresql-9.4

Ned Yoxall

People also ask

1 Answers

Ned Yoxall

Recent Activity

Donate For Us