Here's my table schema:
CREATE TABLE tickers (
product_id TEXT NOT NULL,
trade_id INT NOT NULL,
sequence BIGINT NOT NULL,
time TIMESTAMPTZ,
price NUMERIC NOT NULL,
side TEXT NOT NULL,
last_size NUMERIC NOT NULL,
best_bid NUMERIC NOT NULL,
best_ask NUMERIC NOT NULL,
PRIMARY KEY (product_id, trade_id)
);
My application subscribes to Coinbase Pro's websocket on the "ticker" channel and inserts a row into the tickers table whenever it receives a message.
The table has nearly two million rows now.
I assumed that running SELECT DISTINCT product_id FROM tickers
would be fast, but it takes around 500 to 600 milliseconds. Here's the output from EXPLAIN ANALYZE
:
HashAggregate (cost=47938.97..47939.38 rows=40 width=8) (actual time=583.105..583.110 rows=40 loops=1)
Group Key: product_id
-> Seq Scan on tickers (cost=0.00..42990.98 rows=1979198 width=8) (actual time=0.030..195.536 rows=1979243 loops=1)
Planning Time: 0.068 ms
Execution Time: 583.137 ms
If I turn off seq scanning by running SET enable_seqscan = FALSE
(not something I want to actually rely on, just doing it for testing purposes) then the query is a little faster. Between 400 and 500 milliseconds. Here's the output from EXPLAIN ANALYZE
:
Unique (cost=0.43..80722.61 rows=40 width=8) (actual time=0.020..480.339 rows=40 loops=1)
-> Index Only Scan using tickers_pkey on tickers (cost=0.43..75772.49 rows=1980051 width=8) (actual time=0.019..344.113 rows=1980160 loops=1)
Heap Fetches: 328693
Planning Time: 0.064 ms
Execution Time: 480.386 ms
There are only 40 unique product IDs in the table. I assumed that since product_id
is part of the composite primary key, and thus indexed, SELECT DISTINCT product_id FROM tickers
would be much faster. But as it turns out, the query planner defaults to using a seq scan rather than the index, and even if I force it to use the index it's still slow (but a little faster than seq scan). I realize I could create another table to store nothing but unique product IDs and query that instead, but I'm more concerned with the reason(s) why my query on the tickers table is taking so long.
EDIT #1:
I tried creating an index solely on the product_id column (CREATE INDEX idx_tickers_product_id ON tickers (product_id)
) and the query planner still does a sequential scan unless I run SET enable_seqscan = FALSE
first. But its performance is slightly better (10 to 50 milliseconds faster) than when the composite PK index is used.
EDIT #2: I tried Erwin Brandstetter's solution and it greatly improved the speed. There are now 2.25 million rows in the table and the execution only takes 0.75 milliseconds!
EDIT #3: I wanted to augment the accepted solution in order to retrieve the ticker count (max(trade_id) - min(trade_id) + 1) as well as the min and max time for each product id. I created a new question for this: How to use index skip emulation in PostgreSQL to retrieve distinct product IDs and also min/max for certain columns
You probably don't want to hear this, but the best option to speed up SELECT DISTINCT is to avoid DISTINCT to begin with. In many cases (not all!) it can be avoided with better database-design or better queries. Sometimes, GROUP BY is faster, because it takes a different code path.
It's slow because the database is iterating over all the logs and all the dashboards, then joining them, then sorting them, all before getting down to real work of grouping and aggregating.
Very few queries may perform faster in SELECT DISTINCT mode, and very few will perform slower (but not significantly slower) in SELECT DISTINCT mode but for the later case it is likely that the application may need to examine the duplicate cases, which shifts the performance and complexity burden to the application.
After running the test several times, I found that GROUP BY was consistently about 5 - 10% faster than SELECT DISTINCT: The only test that did not bear this out was when I ran each query only once.
While there is no index skip scan in Postgres yet, emulate it:
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT product_id
FROM tickers
ORDER BY 1
LIMIT 1
)
UNION ALL
SELECT l.*
FROM cte c
CROSS JOIN LATERAL (
SELECT product_id
FROM tickers t
WHERE t.product_id > c.product_id -- lateral reference
ORDER BY 1
LIMIT 1
) l
)
TABLE cte;
With an index on (product_id)
and only 40 unique product IDs in the table this should be Fast. With capital F.
The PK index on (product_id, trade_id)
is good for it, too!
With only very few rows per product_id
(the opposite of your data distribution), DISTINCT
/ DISTINCT ON
would be as fast or faster.
Work to implement index skip scans is ongoing.
See:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With