Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tricky postgresql query optimization (distinct row aggregation with ordering)

I have a table of events that has a very similar schema and data distribution as this artificial table that can easily be generated locally:

CREATE TABLE events AS
WITH args AS (
    SELECT
        300 AS scale_factor, -- feel free to reduce this to speed up local testing
        1000 AS pa_count,
        1 AS l_count_min,
        29 AS l_count_rand,
        10 AS c_count,
        10 AS pr_count,
        3 AS r_count,
        '10 days'::interval AS time_range -- edit 2017-05-02: the real data set has years worth of data here, but the query time ranges stay small (a couple days)
)

SELECT
    p.c_id,
    'ABC'||lpad(p.pa_id::text, 13, '0') AS pa_id,
    'abcdefgh-'||((random()*(SELECT pr_count-1 FROM args)+1))::int AS pr_id,
    ((random()*(SELECT r_count-1 FROM args)+1))::int AS r,
    '2017-01-01Z00:00:00'::timestamp without time zone + random()*(SELECT time_range FROM args) AS t
FROM (
    SELECT
        pa_id,
        ((random()*(SELECT c_count-1 FROM args)+1))::int AS c_id,
        (random()*(SELECT l_count_rand FROM args)+(SELECT l_count_min FROM args))::int AS l_count
    FROM generate_series(1, (SELECT pa_count*scale_factor FROM args)) pa_id
) p
JOIN LATERAL (
    SELECT generate_series(1, p.l_count)
) l(id) ON (true);

Excerpt from SELECT * FROM events:

enter image description here

What I need is a query that selects all rows for a given c_id in a given time range of t, then filters them down to only include the most recent rows (by t) for each unique pr_id and pa_id combination, and then counts the number of pr_id and r combinations of those rows.

That's a quite a mouthful, so here are 3 SQL queries that I came up with that produce the desired results:

WITH query_a AS (
    SELECT
        pr_id,
        r,
        count(1) AS quantity
    FROM (
        SELECT DISTINCT ON (pr_id, pa_id)
          pr_id,
          pa_id,
          r
        FROM events
        WHERE
          c_id = 5 AND
          t >= '2017-01-03Z00:00:00' AND
          t < '2017-01-06Z00:00:00'
        ORDER BY pr_id, pa_id, t DESC
    ) latest
    GROUP BY
        1,
        2
    ORDER BY 3, 2, 1 DESC
),


query_b AS (
    SELECT
        pr_id,
        r,
        count(1) AS quantity
    FROM (
        SELECT
          pr_id,
          pa_id,
          first_not_null(r ORDER BY t DESC) AS r
        FROM events
        WHERE
          c_id = 5 AND
          t >= '2017-01-03Z00:00:00' AND
          t < '2017-01-06Z00:00:00'
        GROUP BY
          1,
          2
    ) latest
    GROUP BY
        1,
        2
    ORDER BY 3, 2, 1 DESC
),

query_c AS (
    SELECT
        pr_id,
        r,
        count(1) AS quantity
    FROM (
        SELECT
          pr_id,
          pa_id,
          first_not_null(r) AS r
        FROM events
        WHERE
          c_id = 5 AND
          t >= '2017-01-03Z00:00:00' AND
          t < '2017-01-06Z00:00:00'
        GROUP BY
          1,
          2
    ) latest
    GROUP BY
        1,
        2
    ORDER BY 3, 2, 1 DESC
)

And here is the custom aggregate function used by query_b and query_c, as well as what I believe to be the most optimal index, settings and conditions:

CREATE FUNCTION first_not_null_agg(before anyelement, value anyelement) RETURNS anyelement
    LANGUAGE sql IMMUTABLE STRICT
    AS $_$
  SELECT $1;
$_$;


CREATE AGGREGATE first_not_null(anyelement) (
    SFUNC = first_not_null_agg,
    STYPE = anyelement
);


CREATE INDEX events_idx ON events USING btree (c_id, t DESC, pr_id, pa_id, r);
VACUUM ANALYZE events;
SET work_mem='128MB';

My dilemma is that query_c outperforms query_a and query_b by a factor of > 6x, but is technically not guaranteed to produce the same result as the other queries (notice the missing ORDER BY in the first_not_null aggregate). However, in practice it seems to pick a query plan that I believe to be correct and most optimal.

Below are the EXPLAIN (ANALYZE, VERBOSE) outputs for all 3 queries on my local machine:

query_a:

CTE Scan on query_a  (cost=25810.77..26071.25 rows=13024 width=44) (actual time=3329.921..3329.934 rows=30 loops=1)
  Output: query_a.pr_id, query_a.r, query_a.quantity
  CTE query_a
    ->  Sort  (cost=25778.21..25810.77 rows=13024 width=23) (actual time=3329.918..3329.921 rows=30 loops=1)
          Output: events.pr_id, events.r, (count(1))
          Sort Key: (count(1)), events.r, events.pr_id DESC
          Sort Method: quicksort  Memory: 27kB
          ->  HashAggregate  (cost=24757.86..24888.10 rows=13024 width=23) (actual time=3329.849..3329.892 rows=30 loops=1)
                Output: events.pr_id, events.r, count(1)
                Group Key: events.pr_id, events.r
                ->  Unique  (cost=21350.90..22478.71 rows=130237 width=40) (actual time=3168.656..3257.299 rows=116547 loops=1)
                      Output: events.pr_id, events.pa_id, events.r, events.t
                      ->  Sort  (cost=21350.90..21726.83 rows=150375 width=40) (actual time=3168.655..3209.095 rows=153795 loops=1)
                            Output: events.pr_id, events.pa_id, events.r, events.t
                            Sort Key: events.pr_id, events.pa_id, events.t DESC
                            Sort Method: quicksort  Memory: 18160kB
                            ->  Index Only Scan using events_idx on public.events  (cost=0.56..8420.00 rows=150375 width=40) (actual time=0.038..101.584 rows=153795 loops=1)
                                  Output: events.pr_id, events.pa_id, events.r, events.t
                                  Index Cond: ((events.c_id = 5) AND (events.t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (events.t < '2017-01-06 00:00:00'::timestamp without time zone))
                                  Heap Fetches: 0
Planning time: 0.316 ms
Execution time: 3331.082 ms

query_b:

CTE Scan on query_b  (cost=67140.75..67409.53 rows=13439 width=44) (actual time=3761.077..3761.090 rows=30 loops=1)
  Output: query_b.pr_id, query_b.r, query_b.quantity
  CTE query_b
    ->  Sort  (cost=67107.15..67140.75 rows=13439 width=23) (actual time=3761.074..3761.081 rows=30 loops=1)
          Output: events.pr_id, (first_not_null(events.r ORDER BY events.t DESC)), (count(1))
          Sort Key: (count(1)), (first_not_null(events.r ORDER BY events.t DESC)), events.pr_id DESC
          Sort Method: quicksort  Memory: 27kB
          ->  HashAggregate  (cost=66051.24..66185.63 rows=13439 width=23) (actual time=3760.997..3761.049 rows=30 loops=1)
                Output: events.pr_id, (first_not_null(events.r ORDER BY events.t DESC)), count(1)
                Group Key: events.pr_id, first_not_null(events.r ORDER BY events.t DESC)
                ->  GroupAggregate  (cost=22188.98..63699.49 rows=134386 width=32) (actual time=2961.471..3671.669 rows=116547 loops=1)
                      Output: events.pr_id, events.pa_id, first_not_null(events.r ORDER BY events.t DESC)
                      Group Key: events.pr_id, events.pa_id
                      ->  Sort  (cost=22188.98..22578.94 rows=155987 width=40) (actual time=2961.436..3012.440 rows=153795 loops=1)
                            Output: events.pr_id, events.pa_id, events.r, events.t
                            Sort Key: events.pr_id, events.pa_id
                            Sort Method: quicksort  Memory: 18160kB
                            ->  Index Only Scan using events_idx on public.events  (cost=0.56..8734.27 rows=155987 width=40) (actual time=0.038..97.336 rows=153795 loops=1)
                                  Output: events.pr_id, events.pa_id, events.r, events.t
                                  Index Cond: ((events.c_id = 5) AND (events.t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (events.t < '2017-01-06 00:00:00'::timestamp without time zone))
                                  Heap Fetches: 0
Planning time: 0.385 ms
Execution time: 3761.852 ms

query_c:

CTE Scan on query_c  (cost=51400.06..51660.54 rows=13024 width=44) (actual time=524.382..524.395 rows=30 loops=1)
  Output: query_c.pr_id, query_c.r, query_c.quantity
  CTE query_c
    ->  Sort  (cost=51367.50..51400.06 rows=13024 width=23) (actual time=524.380..524.384 rows=30 loops=1)
          Output: events.pr_id, (first_not_null(events.r)), (count(1))
          Sort Key: (count(1)), (first_not_null(events.r)), events.pr_id DESC
          Sort Method: quicksort  Memory: 27kB
          ->  HashAggregate  (cost=50347.14..50477.38 rows=13024 width=23) (actual time=524.311..524.349 rows=30 loops=1)
                Output: events.pr_id, (first_not_null(events.r)), count(1)
                Group Key: events.pr_id, first_not_null(events.r)
                ->  HashAggregate  (cost=46765.62..48067.99 rows=130237 width=32) (actual time=401.480..459.962 rows=116547 loops=1)
                      Output: events.pr_id, events.pa_id, first_not_null(events.r)
                      Group Key: events.pr_id, events.pa_id
                      ->  Index Only Scan using events_idx on public.events  (cost=0.56..8420.00 rows=150375 width=32) (actual time=0.027..109.459 rows=153795 loops=1)
                            Output: events.c_id, events.t, events.pr_id, events.pa_id, events.r
                            Index Cond: ((events.c_id = 5) AND (events.t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (events.t < '2017-01-06 00:00:00'::timestamp without time zone))
                            Heap Fetches: 0
Planning time: 0.296 ms
Execution time: 525.566 ms

Broadly speaking, I believe that the index above should allow query_a and query_b to be executed without the Sort nodes that slow them down, but so far I've failed to convince the postgres query optimizer to do my bidding.

I'm also somewhat confused about the t column not being included in the Sort key for query_b, considering that quicksort is not stable. It seems like this could yield the wrong results.

I've verified that all 3 queries generate the same results running the following queries and verifying they produce an empty result set:

SELECT * FROM query_a
EXCEPT
SELECT * FROM query_b;

and

SELECT * FROM query_a
EXCEPT
SELECT * FROM query_c;

I'd consider query_a to be the canonical query when in doubt.

I greatly appreciate any input on this. I've actually found a terribly hacky workaround to achieve acceptable performance in my application, but this problem continues to hunt me in my sleep (and in fact vacation, which I'm currently on) ... 😬.

FWIW, I've looked at many similar questions and answers which have guided my current thinking, but I believe there is something unique about the two column grouping (pr_id, pa_id) and having to sort by a 3rd column (t) that doesn't make this a duplicate question.

Edit: The outer queries in the example may be entirely irrelevant to the question, so feel free to ignore them if it helps.

like image 985
Felix Geisendörfer Avatar asked Apr 06 '17 07:04

Felix Geisendörfer


1 Answers

[EDITED] Ok, As this depend of your data distribution here is another way to do it.

First add the following index :

CREATE INDEX events_idx2 ON events (c_id, t DESC, pr_id, pa_id, r);

This extract the MAX(t) as quick as he can, on the assumption that the sub set will be way smaller to join on the parent table. It may however probably be slower if the dataset is not that small.

SELECT
    e.pr_id,
    e.r,
    count(1) AS quantity
FROM events e
JOIN (
    SELECT
        pr_id,
        pa_id,
        MAX(t) last_t
    FROM events e
    WHERE
        c_id = 5 
        AND t >= '2017-01-03Z00:00:00' 
        AND t < '2017-01-06Z00:00:00'
    GROUP BY 
        pr_id, 
        pa_id
) latest 
    ON (
        c_id = 5 
        AND latest.pr_id = e.pr_id
        AND latest.pa_id = e.pa_id
        AND latest.last_t = e.t
    )
GROUP BY
    e.pr_id,
    e.r
ORDER BY 3, 2, 1 DESC

Full Fiddle

SQL Fiddle

PostgreSQL 9.3 Schema Setup:

--PostgreSQL 9.6
--'\\' is a delimiter

-- CREATE TABLE events AS...

VACUUM  ANALYZE events;
CREATE INDEX idx_events_idx ON events (c_id, t DESC, pr_id, pa_id, r);

Query 1:

  -- query A
explain analyze SELECT
        pr_id,
        r,
        count(1) AS quantity
    FROM (
        SELECT DISTINCT ON (pr_id, pa_id)
          pr_id,
          pa_id,
          r
        FROM events
        WHERE
          c_id = 5 AND
          t >= '2017-01-03Z00:00:00' AND
          t < '2017-01-06Z00:00:00'
        ORDER BY pr_id, pa_id, t DESC
    ) latest
    GROUP BY
        1,
        2
    ORDER BY 3, 2, 1 DESC

Results:

QUERY PLAN
Sort  (cost=2170.24..2170.74 rows=200 width=15) (actual time=358.239..358.245 rows=30 loops=1)
Sort Key: (count(1)), events.r, events.pr_id
Sort Method: quicksort  Memory: 27kB
->  HashAggregate  (cost=2160.60..2162.60 rows=200 width=15) (actual time=358.181..358.189 rows=30 loops=1)
->  Unique  (cost=2012.69..2132.61 rows=1599 width=40) (actual time=327.345..353.750 rows=12098 loops=1)
->  Sort  (cost=2012.69..2052.66 rows=15990 width=40) (actual time=327.344..348.686 rows=15966 loops=1)
Sort Key: events.pr_id, events.pa_id, events.t
Sort Method: external merge  Disk: 792kB
->  Index Only Scan using idx_events_idx on events  (cost=0.42..896.20 rows=15990 width=40) (actual time=0.059..5.475 rows=15966 loops=1)
Index Cond: ((c_id = 5) AND (t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 358.610 ms

Query 2:

  -- query max/JOIN
explain analyze     SELECT
        e.pr_id,
        e.r,
        count(1) AS quantity
    FROM events e
    JOIN (
        SELECT
            pr_id,
            pa_id,
            MAX(t) last_t
        FROM events e
        WHERE
            c_id = 5 
            AND t >= '2017-01-03Z00:00:00' 
            AND t < '2017-01-06Z00:00:00'
        GROUP BY 
            pr_id, 
            pa_id
    ) latest 
        ON (
            c_id = 5 
            AND latest.pr_id = e.pr_id
            AND latest.pa_id = e.pa_id
            AND latest.last_t = e.t
        )
    GROUP BY
        e.pr_id,
        e.r
    ORDER BY 3, 2, 1 DESC 

Results:

QUERY PLAN
Sort  (cost=4153.31..4153.32 rows=1 width=15) (actual time=68.398..68.402 rows=30 loops=1)
Sort Key: (count(1)), e.r, e.pr_id
Sort Method: quicksort  Memory: 27kB
->  HashAggregate  (cost=4153.29..4153.30 rows=1 width=15) (actual time=68.363..68.371 rows=30 loops=1)
->  Merge Join  (cost=1133.62..4153.29 rows=1 width=15) (actual time=35.083..64.154 rows=12098 loops=1)
Merge Cond: ((e.t = (max(e_1.t))) AND (e.pr_id = e_1.pr_id))
Join Filter: (e.pa_id = e_1.pa_id)
->  Index Only Scan Backward using idx_events_idx on events e  (cost=0.42..2739.72 rows=53674 width=40) (actual time=0.010..8.073 rows=26661 loops=1)
Index Cond: (c_id = 5)
Heap Fetches: 0
->  Sort  (cost=1133.19..1137.19 rows=1599 width=36) (actual time=29.778..32.885 rows=12098 loops=1)
Sort Key: (max(e_1.t)), e_1.pr_id
Sort Method: external sort  Disk: 640kB
->  HashAggregate  (cost=1016.12..1032.11 rows=1599 width=36) (actual time=12.731..16.738 rows=12098 loops=1)
->  Index Only Scan using idx_events_idx on events e_1  (cost=0.42..896.20 rows=15990 width=36) (actual time=0.029..5.084 rows=15966 loops=1)
Index Cond: ((c_id = 5) AND (t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 68.736 ms

Query 3:

DROP INDEX idx_events_idx
CREATE INDEX idx_events_flutter ON events (c_id, pr_id, pa_id, t DESC, r)

Query 5:

  -- query A + index by flutter
explain analyze SELECT
        pr_id,
        r,
        count(1) AS quantity
    FROM (
        SELECT DISTINCT ON (pr_id, pa_id)
          pr_id,
          pa_id,
          r
        FROM events
        WHERE
          c_id = 5 AND
          t >= '2017-01-03Z00:00:00' AND
          t < '2017-01-06Z00:00:00'
        ORDER BY pr_id, pa_id, t DESC
    ) latest
    GROUP BY
        1,
        2
    ORDER BY 3, 2, 1 DESC

Results:

QUERY PLAN
Sort  (cost=2744.82..2745.32 rows=200 width=15) (actual time=20.915..20.916 rows=30 loops=1)
Sort Key: (count(1)), events.r, events.pr_id
Sort Method: quicksort  Memory: 27kB
->  HashAggregate  (cost=2735.18..2737.18 rows=200 width=15) (actual time=20.883..20.892 rows=30 loops=1)
->  Unique  (cost=0.42..2707.20 rows=1599 width=40) (actual time=0.037..16.488 rows=12098 loops=1)
->  Index Only Scan using idx_events_flutter on events  (cost=0.42..2627.25 rows=15990 width=40) (actual time=0.036..10.893 rows=15966 loops=1)
Index Cond: ((c_id = 5) AND (t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 20.964 ms
like image 192
Blag Avatar answered Oct 18 '22 22:10

Blag