Efficient way to combine SELECT DISTINCT, ORDER BY, and a small LIMIT

Question

My table is:

create table transactions
    transaction_id  integer  not null,
    transaction_timestamp integer  not null,
    input_index     smallint,
    output_index    smallint not null,
    from_id         integer,
    to_id           integer  not null,
    input_value     real,
    output_value    real     not null,
    constraint unique_transactions
        unique (transaction_id, from_id, to_id)
);

with the following indexes:

create index idx_transactions_from_id_block_timestamp
    on transactions (from_id asc, transaction_timestamp desc);

create index idx_transactions_to_id_block_timestamp
    on transactions (to_id asc, transaction_timestamp desc);

create index idx_transactions_transaction_id
    on transactions (transaction_id);

create index idx_transactions_block_timestamp
    on transactions (transaction_timestamp desc);

The query I ideally would want to have is-

select distinct on (transaction_id,output_index) *
from transactions
where to_id = 1000
and transaction_timestamp between 1691193600 AND 1711929600
order by transaction_timestamp desc
limit 10

Giving me the most recent 10 unique (transaction_id,output_index) pairs (don't care which from_id and input_index are selected to stay).

This straight forward way doesn't work, since postgres requires the order by to first contain the distinct on columns. ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions

doing so will reorder my rows, selecting the first 10 with the highest transaction_id, which i do not want.

Is there an efficient way to do this, using the low limit number to hopefully not have to go over the millions of rows in the table?

I have attempted the following queries, all ended up taking too long since they needed to work on the whole table, without using the small limit 10.

Query 1:

WITH RankedTransactions AS (
        SELECT *,
               ROW_NUMBER() OVER (PARTITION BY transaction_id, output_index ORDER BY transaction_timestamp DESC) AS rn
        FROM transactions
        WHERE to_id = 1000
          and transaction_timestamp between 1691193600 AND 1711929600
    )
    SELECT transaction_id,
           input_index,
           output_index,
           transaction_timestamp,
           from_id,
           to_id,
           input_value,
           output_value
    FROM RankedTransactions
    WHERE rn = 1
    ORDER BY transaction_timestamp DESC
    LIMIT 10;

Query 2:

SELECT *
FROM (
    SELECT DISTINCT ON (transaction_id, output_index) *
    FROM transactions
    WHERE to_id = 1000
    and transaction_timestamp between 1691193600 AND 1711929600
    ORDER BY transaction_id, output_index DESC
) AS latest_transactions
ORDER BY transaction_timestamp DESC
LIMIT 10;

Erwin Brandstetter · Accepted Answer

This works:

SELECT *
FROM  (
   SELECT DISTINCT ON (transaction_id, output_index) *
   FROM   transactions
   WHERE  to_id = 1000
   AND    transaction_timestamp BETWEEN 1691193600 AND 1711929600
   ORDER  BY transaction_id, output_index, transaction_timestamp DESC  -- !!!
   ) AS latest_transactions
ORDER  BY transaction_timestamp DESC
LIMIT  10;

I removed the distracting DESC from output_index in the sort order.

The best query (and index) depend on how many rows (and duplicates within those) to expect per basic selection (after filtering by WHERE to_id = 1000 AND transaction_timestamp BETWEEN 1691193600 AND 1711929600).

If there are not many of either, excess rows can be filtered and this query is just fine when backed by your (existing) index on (to_id, transaction_timestamp DESC).

With a large number of qualifying rows and/or a large number of duplicates on (transaction_id, output_index), things get more complicated.

PostgreSQL DISTINCT ON with different ORDER BY
Select first row in each GROUP BY group?
Optimize GROUP BY query to retrieve latest row per user

Recursive query

If there are many qualifying rows, but not very many duplicates on (transaction_id, output_index) within that set, and while your LIMIT is very small (like LIMIT 10), this recursive query should be very fast.

I collect the (few!) selected combinations of (transaction_id, output_index) in an array to use that as filter going forward. Create a matching composite type for the purpose once:

CREATE TYPE trans_out AS (transaction_id integer, output_index smallint);

Then:

WITH RECURSIVE cte AS (
   (                                -- parentheses required
   SELECT t AS my_row, ARRAY[(transaction_id, output_index)::trans_out] AS selected
   FROM   transactions t
   WHERE  to_id = 1000
   AND    transaction_timestamp >= 1691193600
   AND    transaction_timestamp <= 1711929600
   ORDER  BY transaction_timestamp DESC
   LIMIT  1
   )
   UNION ALL
   SELECT t.my_row, c.selected || t.sel1
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT t AS my_row, (t.transaction_id, t.output_index)::trans_out AS sel1
      FROM   transactions t
      WHERE  t.to_id = 1000
      AND    t.transaction_timestamp >= 1691193600
      AND    t.transaction_timestamp <= (c.my_row).transaction_timestamp  -- parentheses required!
      AND   (t.transaction_id, t.output_index)::trans_out <> ALL (selected)
      ORDER  BY transaction_timestamp DESC
      LIMIT  1
      ) t
   )
SELECT (my_row).*  -- parentheses required!
FROM   cte;

This still only needs your existing index on (to_id, transaction_timestamp DESC) (where DESC does not matter).

Efficient way to combine SELECT DISTINCT, ORDER BY, and a small LIMIT

Tags:

sql

postgresql

greatest-n-per-group

Ido Shimshi

1 Answers

Recursive query

Erwin Brandstetter

Recent Activity

Donate For Us

Efficient way to combine SELECT DISTINCT, ORDER BY, and a small LIMIT

Tags:

sql

postgresql

greatest-n-per-group

Ido Shimshi

1 Answers

Recursive query

Erwin Brandstetter

Related questions

Recent Activity

Donate For Us