I have the following log table for user messages (simplified form) in Postgres 9.2: <pre class="prettyprint"><code>CREATE TABLE log ( log_date DATE, user_id INTEGER, payload INTEGER ); </code></pre> It contains up to one record per user and per day. There will be approximately 500K records per day for 300 days. payload is ever increasing for each user (if that matters). I want to efficiently retrieve the latest record for each user before a specific date. My query is: <pre class="prettyprint"><code>SELECT user_id, max(log_date), max(payload) FROM log WHERE log_date <= :mydate GROUP BY user_id </code></pre> which is extremely slow. I have also tried: <pre class="prettyprint"><code>SELECT DISTINCT ON(user_id), log_date, payload FROM log WHERE log_date <= :mydate ORDER BY user_id, log_date DESC; </code></pre> which has the same plan and is equally slow. So far I have a single index on <code>log(log_date)</code>, but doesn't help much. And I have a <code>users</code> table with all users included. I also want to retrieve the result for some some users (those with <code>payload > :value</code>). Is there any other index I should use to speed this up, or any other way to achieve what I want?

This is not a standalone answer but rather a comment to @Erwin's answer. For 2a, the lateral join example, the query can be improved by sorting the <code>users</code> table to exploit the locality of the index on <code>log</code>. <pre class="prettyprint"><code>SELECT u.user_id, l.log_date, l.payload FROM (SELECT user_id FROM users ORDER BY user_id) u, LATERAL (SELECT log_date, payload FROM log WHERE user_id = u.user_id -- lateral reference AND log_date <= :mydate ORDER BY log_date DESC NULLS LAST LIMIT 1) l; </code></pre> The rationale is that index lookup is expensive if <code>user_id</code> values are random. By sorting out <code>user_id</code> first, the subsequent lateral join would be like a simple scan on the index of <code>log</code>. Even though both query plans look alike, the running time would differ much especially for large tables. The cost of the sorting is minimal especially if there is an index on the <code>user_id</code> field.

Perhaps a different index on the table would help. Try this one: <code>log(user_id, log_date)</code>. I am not positive that Postgres will make optimal use with <code>distinct on</code>. So, I would stick with that index and try this version: <pre class="prettyprint"><code>select * from log l where not exists (select 1 from log l2 where l2.user_id = l.user_id and l2.log_date <= :mydate and l2.log_date > l.log_date ); </code></pre> This should replace the sorting/grouping with index look ups. It might be faster.

Optimize GROUP BY query to retrieve latest row per user

Tags:

sql

indexing

postgresql

greatest-n-per-group

postgresql-performance

I have the following log table for user messages (simplified form) in Postgres 9.2:

CREATE TABLE log (
    log_date DATE,
    user_id  INTEGER,
    payload  INTEGER
);

It contains up to one record per user and per day. There will be approximately 500K records per day for 300 days. payload is ever increasing for each user (if that matters).

I want to efficiently retrieve the latest record for each user before a specific date. My query is:

SELECT user_id, max(log_date), max(payload) 
FROM log 
WHERE log_date <= :mydate 
GROUP BY user_id

which is extremely slow. I have also tried:

SELECT DISTINCT ON(user_id), log_date, payload
FROM log
WHERE log_date <= :mydate
ORDER BY user_id, log_date DESC;

which has the same plan and is equally slow.

So far I have a single index on log(log_date), but doesn't help much.

And I have a users table with all users included. I also want to retrieve the result for some some users (those with payload > :value).

Is there any other index I should use to speed this up, or any other way to achieve what I want?

254

asked Aug 27 '14 20:08

xpapad

3 Answers

For best read performance you need a multicolumn index:

CREATE INDEX log_combo_idx
ON log (user_id, log_date DESC NULLS LAST);

To make index only scans possible, add the otherwise not needed column payload in a covering index with the INCLUDE clause (Postgres 11 or later):

CREATE INDEX log_combo_covering_idx
ON log (user_id, log_date DESC NULLS LAST) INCLUDE (payload);

See:

Do covering indexes in PostgreSQL help JOIN columns?

Fallback for older versions:

CREATE INDEX log_combo_covering_idx
ON log (user_id, log_date DESC NULLS LAST, payload);

Why DESC NULLS LAST?

Unused index in range of dates query

For few rows per user_id or small tables DISTINCT ON is typically fastest and simplest:

Select first row in each GROUP BY group?

For many rows per user_id an index skip scan (or loose index scan) is (much) more efficient. That's not implemented up to Postgres 12 - work is ongoing for Postgres 14. But there are ways to emulate it efficiently.

Common Table Expressions require Postgres 8.4+.
LATERAL requires Postgres 9.3+.
The following solutions go beyond what's covered in the Postgres Wiki.

1. No separate table with unique users

With a separate users table, solutions in 2. below are typically simpler and faster. Skip ahead.

1a. Recursive CTE with `LATERAL` join

WITH RECURSIVE cte AS (
   (                                -- parentheses required
   SELECT user_id, log_date, payload
   FROM   log
   WHERE  log_date <= :mydate
   ORDER  BY user_id, log_date DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT l.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT l.user_id, l.log_date, l.payload
      FROM   log l
      WHERE  l.user_id > c.user_id  -- lateral reference
      AND    log_date <= :mydate    -- repeat condition
      ORDER  BY l.user_id, l.log_date DESC NULLS LAST
      LIMIT  1
      ) l
   )
TABLE  cte
ORDER  BY user_id;

This is simple to retrieve arbitrary columns and probably best in current Postgres. More explanation in chapter 2a. below.

1b. Recursive CTE with correlated subquery

WITH RECURSIVE cte AS (
   (                                           -- parentheses required
   SELECT l AS my_row                          -- whole row
   FROM   log l
   WHERE  log_date <= :mydate
   ORDER  BY user_id, log_date DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT l                            -- whole row
           FROM   log l
           WHERE  l.user_id > (c.my_row).user_id
           AND    l.log_date <= :mydate        -- repeat condition
           ORDER  BY l.user_id, l.log_date DESC NULLS LAST
           LIMIT  1)
   FROM   cte c
   WHERE  (c.my_row).user_id IS NOT NULL       -- note parentheses
   )
SELECT (my_row).*                              -- decompose row
FROM   cte
WHERE  (my_row).user_id IS NOT NULL
ORDER  BY (my_row).user_id;

Convenient to retrieve a single column or the whole row. The example uses the whole row type of the table. Other variants are possible.

To assert a row was found in the previous iteration, test a single NOT NULL column (like the primary key).

More explanation for this query in chapter 2b. below.

Query last N related rows per row
GROUP BY one column, while sorting by another in PostgreSQL

2. With separate `users` table

Table layout hardly matters as long as exactly one row per relevant user_id is guaranteed. Example:

CREATE TABLE users (
   user_id  serial PRIMARY KEY
 , username text NOT NULL
);

Ideally, the table is physically sorted in sync with the log table. See:

Optimize Postgres timestamp query range

Or it's small enough (low cardinality) that it hardly matters. Else, sorting rows in the query can help to further optimize performance. See Gang Liang's addition. If the physical sort order of the users table happens to match the index on log, this may be irrelevant.

2a. `LATERAL` join

SELECT u.user_id, l.log_date, l.payload
FROM   users u
CROSS  JOIN LATERAL (
   SELECT l.log_date, l.payload
   FROM   log l
   WHERE  l.user_id = u.user_id         -- lateral reference
   AND    l.log_date <= :mydate
   ORDER  BY l.log_date DESC NULLS LAST
   LIMIT  1
   ) l;

JOIN LATERAL allows to reference preceding FROM items on the same query level. See:

What is the difference between LATERAL JOIN and a subquery in PostgreSQL?

Results in one index (-only) look-up per user.

Returns no row for users missing in the users table. Typically, a foreign key constraint enforcing referential integrity would rule that out.

Also, no row for users without matching entry in log - conforming to the original question. To keep those users in the result use LEFT JOIN LATERAL ... ON true instead of CROSS JOIN LATERAL:

Call a set-returning function with an array argument multiple times

Use LIMIT n instead of LIMIT 1 to retrieve more than one rows (but not all) per user.

Effectively, all of these do the same:

JOIN LATERAL ... ON true
CROSS JOIN LATERAL ...
, LATERAL ...

The last one has lower priority, though. Explicit JOIN binds before comma. That subtle difference can matters with more join tables. See:

"invalid reference to FROM-clause entry for table" in Postgres query

2b. Correlated subquery

Good choice to retrieve a single column from a single row. Code example:

Optimize groupwise maximum query

The same is possible for multiple columns, but you need more smarts:

CREATE TEMP TABLE combo (log_date date, payload int);

SELECT user_id, (combo1).*              -- note parentheses
FROM (
   SELECT u.user_id
        , (SELECT (l.log_date, l.payload)::combo
           FROM   log l
           WHERE  l.user_id = u.user_id
           AND    l.log_date <= :mydate
           ORDER  BY l.log_date DESC NULLS LAST
           LIMIT  1) AS combo1
   FROM   users u
   ) sub;

Like LEFT JOIN LATERAL above, this variant includes all users, even without entries in log. You get NULL for combo1, which you can easily filter with a WHERE clause in the outer query if need be.
_{Nitpick: in the outer query you can't distinguish whether the subquery didn't find a row or all column values happen to be NULL - same result. You need a NOT NULL column in the subquery to avoid this ambiguity.}

A correlated subquery can only return a single value. You can wrap multiple columns into a composite type. But to decompose it later, Postgres demands a well-known composite type. Anonymous records can only be decomposed providing a column definition list.
Use a registered type like the row type of an existing table. Or register a composite type explicitly (and permanently) with CREATE TYPE. Or create a temporary table (dropped automatically at end of session) to register its row type temporarily. Cast syntax: (log_date, payload)::combo

Finally, we do not want to decompose combo1 on the same query level. Due to a weakness in the query planner this would evaluate the subquery once for each column (still true in Postgres 12). Instead, make it a subquery and decompose in the outer query.

Get values from first and last row per group

Demonstrating all 4 queries with 100k log entries and 1k users:
db<>fiddle here - pg 11
_{Old sqlfiddle}

answered Oct 17 '22 18:10

Erwin Brandstetter

This is not a standalone answer but rather a comment to @Erwin's answer. For 2a, the lateral join example, the query can be improved by sorting the users table to exploit the locality of the index on log.

SELECT u.user_id, l.log_date, l.payload
  FROM (SELECT user_id FROM users ORDER BY user_id) u,
       LATERAL (SELECT log_date, payload
                  FROM log
                 WHERE user_id = u.user_id -- lateral reference
                   AND log_date <= :mydate
              ORDER BY log_date DESC NULLS LAST
                 LIMIT 1) l;

The rationale is that index lookup is expensive if user_id values are random. By sorting out user_id first, the subsequent lateral join would be like a simple scan on the index of log. Even though both query plans look alike, the running time would differ much especially for large tables.

The cost of the sorting is minimal especially if there is an index on the user_id field.

answered Oct 17 '22 17:10

Gang Liang

Perhaps a different index on the table would help. Try this one: log(user_id, log_date). I am not positive that Postgres will make optimal use with distinct on.

So, I would stick with that index and try this version:

select *
from log l
where not exists (select 1
                  from log l2
                  where l2.user_id = l.user_id and
                        l2.log_date <= :mydate and
                        l2.log_date > l.log_date
                 );

This should replace the sorting/grouping with index look ups. It might be faster.

answered Oct 17 '22 17:10

Gordon Linoff

Related questions
                            
                                Get count of records affected by INSERT or UPDATE in PostgreSQL
                            
                                How to limit rows in PostgreSQL SELECT
                            
                                Inner Joining three tables
                            
                                How to execute Table valued function
                            
                                MySQL - How to count all rows per table in one query
                            
                                How to replace specific values in a oracle database column?
                            
                                Understanding how JOIN works when 3 or more tables are involved. [SQL]
                            
                                Database/SQL: How to store longitude/latitude data?
                            
                                Pandas read_sql with parameters
                            
                                When or why would you use a right outer join instead of left?
                            
                                List rows after specific date
                            
                                how to check type of value in postgres
                            
                                Empty array as PostgreSQL array column default value
                            
                                Are Views automatically updated
                            
                                SQLite - SELECT TOP syntax error [duplicate]
                            
                                How to pass a null variable to a SQL Stored Procedure from C#.net code
                            
                                SQL query to select distinct row with minimum value
                            
                                Can you define "literal" tables in SQL?
                            
                                PostgreSQL GROUP BY different from MySQL?
                            
                                SQLSTATE[42000]: Syntax error or access violation: 1064 You have an error in your SQL syntax — PHP — PDO [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Optimize GROUP BY query to retrieve latest row per user

Tags:

sql

indexing

postgresql

greatest-n-per-group

postgresql-performance

xpapad

People also ask

3 Answers

1. No separate table with unique users

1a. Recursive CTE with `LATERAL` join

1b. Recursive CTE with correlated subquery

2. With separate `users` table

2a. `LATERAL` join

2b. Correlated subquery

Erwin Brandstetter

Gang Liang

Gordon Linoff

Recent Activity

Donate For Us

Optimize GROUP BY query to retrieve latest row per user

Tags:

sql

indexing

postgresql

greatest-n-per-group

postgresql-performance

xpapad

People also ask

3 Answers

1. No separate table with unique users

1a. Recursive CTE with LATERAL join

1b. Recursive CTE with correlated subquery

2. With separate users table

2a. LATERAL join

2b. Correlated subquery

Erwin Brandstetter

Gang Liang

Gordon Linoff

Related questions

Recent Activity

Donate For Us

1a. Recursive CTE with `LATERAL` join

2. With separate `users` table

2a. `LATERAL` join