Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SELECT DISTINCT is slower than expected on my table in PostgreSQL

Here's my table schema:

CREATE TABLE tickers (
    product_id TEXT NOT NULL,
    trade_id INT NOT NULL,
    sequence BIGINT NOT NULL,
    time TIMESTAMPTZ,
    price NUMERIC NOT NULL,
    side TEXT NOT NULL,
    last_size NUMERIC NOT NULL,
    best_bid NUMERIC NOT NULL,
    best_ask NUMERIC NOT NULL,
    PRIMARY KEY (product_id, trade_id)
);

My application subscribes to Coinbase Pro's websocket on the "ticker" channel and inserts a row into the tickers table whenever it receives a message.

The table has nearly two million rows now.

I assumed that running SELECT DISTINCT product_id FROM tickers would be fast, but it takes around 500 to 600 milliseconds. Here's the output from EXPLAIN ANALYZE:

HashAggregate  (cost=47938.97..47939.38 rows=40 width=8) (actual time=583.105..583.110 rows=40 loops=1)
  Group Key: product_id
  ->  Seq Scan on tickers  (cost=0.00..42990.98 rows=1979198 width=8) (actual time=0.030..195.536 rows=1979243 loops=1)
Planning Time: 0.068 ms
Execution Time: 583.137 ms

If I turn off seq scanning by running SET enable_seqscan = FALSE (not something I want to actually rely on, just doing it for testing purposes) then the query is a little faster. Between 400 and 500 milliseconds. Here's the output from EXPLAIN ANALYZE:

Unique  (cost=0.43..80722.61 rows=40 width=8) (actual time=0.020..480.339 rows=40 loops=1)
  ->  Index Only Scan using tickers_pkey on tickers  (cost=0.43..75772.49 rows=1980051 width=8) (actual time=0.019..344.113 rows=1980160 loops=1)
        Heap Fetches: 328693
Planning Time: 0.064 ms
Execution Time: 480.386 ms

There are only 40 unique product IDs in the table. I assumed that since product_id is part of the composite primary key, and thus indexed, SELECT DISTINCT product_id FROM tickers would be much faster. But as it turns out, the query planner defaults to using a seq scan rather than the index, and even if I force it to use the index it's still slow (but a little faster than seq scan). I realize I could create another table to store nothing but unique product IDs and query that instead, but I'm more concerned with the reason(s) why my query on the tickers table is taking so long.

EDIT #1: I tried creating an index solely on the product_id column (CREATE INDEX idx_tickers_product_id ON tickers (product_id)) and the query planner still does a sequential scan unless I run SET enable_seqscan = FALSE first. But its performance is slightly better (10 to 50 milliseconds faster) than when the composite PK index is used.

EDIT #2: I tried Erwin Brandstetter's solution and it greatly improved the speed. There are now 2.25 million rows in the table and the execution only takes 0.75 milliseconds!

EDIT #3: I wanted to augment the accepted solution in order to retrieve the ticker count (max(trade_id) - min(trade_id) + 1) as well as the min and max time for each product id. I created a new question for this: How to use index skip emulation in PostgreSQL to retrieve distinct product IDs and also min/max for certain columns

like image 670
Richard Gieg Avatar asked Mar 31 '21 19:03

Richard Gieg


People also ask

How can I make SELECT distinct faster?

You probably don't want to hear this, but the best option to speed up SELECT DISTINCT is to avoid DISTINCT to begin with. In many cases (not all!) it can be avoided with better database-design or better queries. Sometimes, GROUP BY is faster, because it takes a different code path.

Why is distinct so slow?

It's slow because the database is iterating over all the logs and all the dashboards, then joining them, then sorting them, all before getting down to real work of grouping and aggregating.

Does SELECT distinct slow down a query?

Very few queries may perform faster in SELECT DISTINCT mode, and very few will perform slower (but not significantly slower) in SELECT DISTINCT mode but for the later case it is likely that the application may need to examine the duplicate cases, which shifts the performance and complexity burden to the application.

Is SELECT distinct slower than GROUP BY?

After running the test several times, I found that GROUP BY was consistently about 5 - 10% faster than SELECT DISTINCT: The only test that did not bear this out was when I ran each query only once.


1 Answers

While there is no index skip scan in Postgres yet, emulate it:

WITH RECURSIVE cte AS (
   (   -- parentheses required
   SELECT product_id
   FROM   tickers
   ORDER  BY 1
   LIMIT  1
   )
   UNION ALL
   SELECT l.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT product_id
      FROM   tickers t
      WHERE  t.product_id > c.product_id  -- lateral reference
      ORDER  BY 1
      LIMIT  1
      ) l
   )
TABLE  cte;

With an index on (product_id) and only 40 unique product IDs in the table this should be Fast. With capital F.
The PK index on (product_id, trade_id) is good for it, too!

With only very few rows per product_id (the opposite of your data distribution), DISTINCT / DISTINCT ON would be as fast or faster.

Work to implement index skip scans is ongoing.
See:

  • Select first row in each GROUP BY group?
  • Optimize GROUP BY query to retrieve latest row per user
  • Is a composite index also good for queries on the first field?
like image 81
Erwin Brandstetter Avatar answered Oct 20 '22 18:10

Erwin Brandstetter