I'm trying to do a simple join between a table (players) and view (player_main_colors):
SELECT P.*, C.main_color FROM players P
OUTER LEFT JOIN player_main_colors C USING (player_id)
WHERE P.user_id=1;
This query is taking ~40 ms.
Here I'm using a nested SELECT on the VIEW instead of the JOIN:
SELECT player_id, main_color FROM player_main_colors
WHERE player_id IN (
SELECT player_id FROM players WHERE user_id=1);
This query is also taking ~40 ms.
When I split the query into its 2 pieces, it becomes fast as I would have expected:
SELECT player_id FROM players WHERE user_id=1;
SELECT player_id, main_color FROM player_main_colors
where player_id in (584, 9337, 11669, 12096, 13651,
13852, 9575, 23388, 14339, 500, 24963, 25630,
8974, 13048, 11904, 10537, 20362, 9216, 4747, 25045);
These queries take ~0.5 ms each.
So why are the above queries with the JOIN or sub-SELECT so horribly slow and how can I fix it?
Here are some details about my tables and the view:
CREATE TABLE users (
user_id INTEGER PRIMARY KEY,
...
)
CREATE TABLE players (
player_id INTEGER PRIMARY KEY,
user_id INTEGER NOT NULL REFERENCES users (user_id),
...
)
CREATE TABLE player_data (
player_id INTEGER NOT NULL REFERENCES players (player_id),
game_id INTEGER NOT NULL,
color INTEGER NOT NULL,
PRIMARY KEY (player_id, game_id, color),
active_time INTEGER DEFAULT 0,
...
)
CREATE VIEW player_main_colors AS
SELECT DISTINCT ON (1) player_id, color as main_color
FROM player_data
GROUP BY player_id, color
ORDER BY 1, MAX(active_time) DESC
It seems like it must be a problem with my VIEW...?
Here's an EXPLAIN ANALYZE for the nested SELECT query above:
Merge Semi Join (cost=1877.59..2118.00 rows=6851 width=8) (actual time=32.946..38.471 rows=25 loops=1)
Merge Cond: (player_data.player_id = players.player_id)
-> Unique (cost=1733.19..1801.70 rows=13701 width=12) (actual time=32.651..37.209 rows=13419 loops=1)
-> Sort (cost=1733.19..1767.45 rows=13701 width=12) (actual time=32.646..34.918 rows=16989 loops=1)
Sort Key: player_data.player_id, (max(player_data.active_time))
Sort Method: external merge Disk: 376kB
-> HashAggregate (cost=654.79..791.80 rows=13701 width=12) (actual time=13.636..19.051 rows=17311 loops=1)
-> Seq Scan on player_data (cost=0.00..513.45 rows=18845 width=12) (actual time=0.005..1.801 rows=18845 loops=1)
-> Sort (cost=144.40..144.53 rows=54 width=8) (actual time=0.226..0.230 rows=54 loops=1)
Sort Key: players.player_id
Sort Method: quicksort Memory: 19kB
-> Bitmap Heap Scan on players (cost=4.67..142.85 rows=54 width=8) (actual time=0.035..0.112 rows=54 loops=1)
Recheck Cond: (user_id = 1)
-> Bitmap Index Scan on test (cost=0.00..4.66 rows=54 width=0) (actual time=0.023..0.023 rows=54 loops=1)
Index Cond: (user_id = 1)
Total runtime: 39.279 ms
As for indexes, I only have 1 relevant one on top of the default ones for my primary keys:
CREATE INDEX player_user_idx ON players (user_id);
I'm currently using PostgreSQL 9.2.9.
Update:
I've reduced the problem below. See the difference between IN (4747) and IN (SELECT 4747).
Slow:
>> explain analyze SELECT * FROM (
SELECT player_id, color
FROM player_data
GROUP BY player_id, color
ORDER BY MAX(active_time) DESC
) S
WHERE player_id IN (SELECT 4747);
Hash Join (cost=1749.99..1975.37 rows=6914 width=8) (actual time=30.492..34.291 rows=4 loops=1)
Hash Cond: (player_data.player_id = (4747))
-> Sort (cost=1749.95..1784.51 rows=13827 width=12) (actual time=30.391..32.655 rows=17464 loops=1)
Sort Key: (max(player_data.active_time))
Sort Method: external merge Disk: 376kB
-> HashAggregate (cost=660.71..798.98 rows=13827 width=12) (actual time=12.714..17.249 rows=17464 loops=1)
-> Seq Scan on player_data (cost=0.00..518.12 rows=19012 width=12) (actual time=0.006..1.898 rows=19012 loops=1)
-> Hash (cost=0.03..0.03 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> HashAggregate (cost=0.02..0.03 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
Total runtime: 35.015 ms
(12 rows)
Time: 35.617 ms
Fast:
>> explain analyze SELECT * FROM (
SELECT player_id, color
FROM player_data
GROUP BY player_id, color
ORDER BY MAX(active_time) DESC
) S
WHERE player_id IN (4747);
Subquery Scan on s (cost=17.40..17.45 rows=4 width=8) (actual time=0.035..0.035 rows=4 loops=1)
-> Sort (cost=17.40..17.41 rows=4 width=12) (actual time=0.034..0.034 rows=4 loops=1)
Sort Key: (max(player_data.active_time))
Sort Method: quicksort Memory: 17kB
-> GroupAggregate (cost=0.00..17.36 rows=4 width=12) (actual time=0.020..0.027 rows=4 loops=1)
-> Index Scan using player_data_pkey on player_data (cost=0.00..17.28 rows=5 width=12) (actual time=0.014..0.019 rows=5 loops=1)
Index Cond: (player_id = 4747)
Total runtime: 0.080 ms
(8 rows)
Time: 0.610 ms
Indexes that help with a merge joinAn index on the sort keys can speed up sorting, so an index on the join keys on both relations can speed up a merge join. However, an explicit sort is often cheaper unless an index only scan can be used.
Some of the tricks we used to speed up SELECT-s in PostgreSQL: LEFT JOIN with redundant conditions, VALUES, extended statistics, primary key type conversion, CLUSTER, pg_hint_plan + bonus. Photo by Richard Jacobs on Unsplash.
PostgreSQL attempts to do a lot of its work in memory, and spread out writing to disk to minimize bottlenecks, but on an overloaded system with heavy writing, it's easily possible to see heavy reads and writes cause the whole system to slow as it catches up on the demands.
If you're simply filtering the data and data fits in memory, Postgres is capable of parsing roughly 5-10 million rows per second (assuming some reasonable row size of say 100 bytes). If you're aggregating then you're at about 1-2 million rows per second.
You have both GROUP BY
and DISTINCT ON
in your VIEW definition. That's like shooting a dead man. Simplify:
CREATE VIEW player_main_colors AS
SELECT DISTINCT ON (1)
player_id, color AS main_color
FROM player_data
ORDER BY 1, active_time DESC NULLS LAST;
NULLS LAST
is necessary to be equivalent to your original because active_time
can be NULL according to your table definition. Should be faster. But there is more. For best performance create these indexes:
CREATE INDEX players_up_idx ON players (user_id, player_id);
CREATE INDEX players_data_pa_idx ON player_data
(player_id, active_time DESC NULLS LAST, color);
Use DESC NULLS LAST
in the index as well to match the sort order of the query. Or you alter player_data.active_time
to NOT NULL
and simplify all.
BTW, it's LEFT OUTER JOIN
not , or just omit the noise word OUTER LEFT JOIN
OUTER
:
SELECT * -- equivalent here to "p.*, c.main_color"
FROM players p
LEFT JOIN player_main_colors c USING (player_id)
WHERE p.user_id = 1;
I'll assume there are lots of rows in player_data
for each player_id
. And you are only selecting a few player_id
. JOIN LATERAL
would be fastest for this case, but you need Postgres 9.3 or later for that. In pg 9.2 you can achieve a similar effect with correlated subqueries:
CREATE VIEW player_main_colors AS
SELECT player_id
, (SELECT color
FROM player_data
WHERE player_id = p.player_id
ORDER BY active_time DESC NULLS LAST
LIMIT 1) AS main_color
FROM players p
ORDER BY 1 -- optional
Subtle difference to your original view: this includes players without any entries in player_data
. You could try the same query as above based on the new view. But I would not use a view at all. This is probably fastest:
SELECT *
, (SELECT color
FROM player_data
WHERE player_id = p.player_id
ORDER BY active_time DESC NULLS LAST
LIMIT 1) AS main_color
FROM players p
WHERE p.user_id = 1;
Detailed explanation:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With