SELECT DISTINCT is slower than expected on my table in PostgreSQL

Tags:

Here's my table schema:

CREATE TABLE tickers (
    product_id TEXT NOT NULL,
    trade_id INT NOT NULL,
    sequence BIGINT NOT NULL,
    time TIMESTAMPTZ,
    price NUMERIC NOT NULL,
    side TEXT NOT NULL,
    last_size NUMERIC NOT NULL,
    best_bid NUMERIC NOT NULL,
    best_ask NUMERIC NOT NULL,
    PRIMARY KEY (product_id, trade_id)
);

My application subscribes to Coinbase Pro's websocket on the "ticker" channel and inserts a row into the tickers table whenever it receives a message.

The table has nearly two million rows now.

I assumed that running SELECT DISTINCT product_id FROM tickers would be fast, but it takes around 500 to 600 milliseconds. Here's the output from EXPLAIN ANALYZE:

HashAggregate  (cost=47938.97..47939.38 rows=40 width=8) (actual time=583.105..583.110 rows=40 loops=1)
  Group Key: product_id
  ->  Seq Scan on tickers  (cost=0.00..42990.98 rows=1979198 width=8) (actual time=0.030..195.536 rows=1979243 loops=1)
Planning Time: 0.068 ms
Execution Time: 583.137 ms

If I turn off seq scanning by running SET enable_seqscan = FALSE (not something I want to actually rely on, just doing it for testing purposes) then the query is a little faster. Between 400 and 500 milliseconds. Here's the output from EXPLAIN ANALYZE:

Unique  (cost=0.43..80722.61 rows=40 width=8) (actual time=0.020..480.339 rows=40 loops=1)
  ->  Index Only Scan using tickers_pkey on tickers  (cost=0.43..75772.49 rows=1980051 width=8) (actual time=0.019..344.113 rows=1980160 loops=1)
        Heap Fetches: 328693
Planning Time: 0.064 ms
Execution Time: 480.386 ms

There are only 40 unique product IDs in the table. I assumed that since product_id is part of the composite primary key, and thus indexed, SELECT DISTINCT product_id FROM tickers would be much faster. But as it turns out, the query planner defaults to using a seq scan rather than the index, and even if I force it to use the index it's still slow (but a little faster than seq scan). I realize I could create another table to store nothing but unique product IDs and query that instead, but I'm more concerned with the reason(s) why my query on the tickers table is taking so long.

EDIT #1: I tried creating an index solely on the product_id column (CREATE INDEX idx_tickers_product_id ON tickers (product_id)) and the query planner still does a sequential scan unless I run SET enable_seqscan = FALSE first. But its performance is slightly better (10 to 50 milliseconds faster) than when the composite PK index is used.

EDIT #2: I tried Erwin Brandstetter's solution and it greatly improved the speed. There are now 2.25 million rows in the table and the execution only takes 0.75 milliseconds!

EDIT #3: I wanted to augment the accepted solution in order to retrieve the ticker count (max(trade_id) - min(trade_id) + 1) as well as the min and max time for each product id. I created a new question for this: How to use index skip emulation in PostgreSQL to retrieve distinct product IDs and also min/max for certain columns

670

asked Mar 31 '21 19:03

Richard Gieg

1 Answers

While there is no index skip scan in Postgres yet, emulate it:

WITH RECURSIVE cte AS (
   (   -- parentheses required
   SELECT product_id
   FROM   tickers
   ORDER  BY 1
   LIMIT  1
   )
   UNION ALL
   SELECT l.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT product_id
      FROM   tickers t
      WHERE  t.product_id > c.product_id  -- lateral reference
      ORDER  BY 1
      LIMIT  1
      ) l
   )
TABLE  cte;

With an index on (product_id) and only 40 unique product IDs in the table this should be Fast. With capital F.
The PK index on (product_id, trade_id) is good for it, too!

With only very few rows per product_id (the opposite of your data distribution), DISTINCT / DISTINCT ON would be as fast or faster.

Work to implement index skip scans is ongoing.
See:

Select first row in each GROUP BY group?
Optimize GROUP BY query to retrieve latest row per user
Is a composite index also good for queries on the first field?

answered Oct 20 '22 18:10

Erwin Brandstetter

Related questions
                            
                                Maximum Number of Records a table variable can have in SQL Server
                            
                                SQL FORMAT function error
                            
                                SQL query that can find Typos in Arabic language
                            
                                Turn off upper-case for table and column names in HSQL?
                            
                                Combined 2 columns into one column SQL
                            
                                Add an autoincrementing ID column to an existing table with Sqlite
                            
                                dapper: Get result and count at same time using QueryMultiple
                            
                                How to update a part of a timestamp field in postgres?
                            
                                T-SQL - Insert Data into Parent and Child Tables
                            
                                ExecuteUpdate sql statement in Java not working
                            
                                Count multiple CASE occurrences in SQL
                            
                                SQL Computation of Cosine Similarity
                            
                                How to convert time without time zone to timestamp without time zone?
                            
                                SQL Server: how to use alias in update statement?
                            
                                How can I compute TF/IDF with SQL (BigQuery)
                            
                                Extracting Time from Timestamp in Redshift
                            
                                Why do these queries return different results?
                            
                                JOIN statement on an alias - SQL Server
                            
                                pg_get_serial_sequence returns nothing (empty string)
                            
                                Where does elementor stores the data from the content that is created from the frontend view?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SELECT DISTINCT is slower than expected on my table in PostgreSQL

Tags:

sql

postgresql

database-performance

query-optimization

postgresql-performance

Richard Gieg

People also ask

1 Answers

Erwin Brandstetter

Recent Activity

Donate For Us