The query:
SELECT "replays_game".* FROM "replays_game" INNER JOIN "replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id" WHERE "replays_playeringame"."player_id" = 50027
If I set SET enable_seqscan = off
, then it does the fast thing, which is:
QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..27349.80 rows=3395 width=72) (actual time=28.726..65.056 rows=3398 loops=1) -> Index Scan using replays_playeringame_player_id on replays_playeringame (cost=0.00..8934.43 rows=3395 width=4) (actual time=0.019..2.412 rows=3398 loops=1) Index Cond: (player_id = 50027) -> Index Scan using replays_game_pkey on replays_game (cost=0.00..5.41 rows=1 width=72) (actual time=0.017..0.017 rows=1 loops=3398) Index Cond: (id = replays_playeringame.game_id) Total runtime: 65.437 ms
But without the dreaded enable_seqscan, it chooses to do a slower thing:
QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=7330.18..18145.24 rows=3395 width=72) (actual time=92.380..535.422 rows=3398 loops=1) Hash Cond: (replays_playeringame.game_id = replays_game.id) -> Index Scan using replays_playeringame_player_id on replays_playeringame (cost=0.00..8934.43 rows=3395 width=4) (actual time=0.020..2.899 rows=3398 loops=1) Index Cond: (player_id = 50027) -> Hash (cost=3668.08..3668.08 rows=151208 width=72) (actual time=90.842..90.842 rows=151208 loops=1) Buckets: 1024 Batches: 32 (originally 16) Memory Usage: 1025kB -> Seq Scan on replays_game (cost=0.00..3668.08 rows=151208 width=72) (actual time=0.020..29.061 rows=151208 loops=1) Total runtime: 535.821 ms
Here are the relevant indexes:
Index "public.replays_game_pkey" Column | Type | Definition --------+---------+------------ id | integer | id primary key, btree, for table "public.replays_game" Index "public.replays_playeringame_player_id" Column | Type | Definition -----------+---------+------------ player_id | integer | player_id btree, for table "public.replays_playeringame"
So my question is, what am I doing wrong that Postgres is mis-estimating the relative costs of the two ways of joining? I see in the cost estimates that it thinks the hash-join will be faster. And its estimate of the cost of the index-join is off by a factor of 500.
How can I give Postgres more of a clue? I did run a VACUUM ANALYZE
immediately before running all of the above.
Interestingly, if I run this query for a player with a smaller # of games, Postgres chooses to do the index-scan + nested-loop. So something about the large # of games tickles this undesired behavior where relative estimated cost is out of line with actual estimated cost.
Finally, should I be using Postgres at all? I don't wish to become an expert in database tuning, so I'm looking for a database that will perform reasonably well with a conscientious developer's level of attention, as opposed to a dedicated DBA. I am afraid that if I stick with Postgres I will have a steady stream of issues like this that will force me to become a Postgres expert, and perhaps another DB will be more forgiving of a more casual approach.
A Postgres expert (RhodiumToad) reviewed my full database settings (http://pastebin.com/77QuiQSp) and recommended set cpu_tuple_cost = 0.1
. That gave a dramatic speedup: http://pastebin.com/nTHvSHVd
Alternatively, switching to MySQL also solved the problem pretty nicely. I have a default installation of MySQL and Postgres on my OS X box, and MySQL is 2x faster, comparing queries that are "warmed up" by repeatedly executing the query. On "cold" queries, i.e. the first time a given query is executed, MySQL is 5 to 150 times faster. The performance of cold queries is pretty important for my particular application.
The big question, as far as I'm concerned, is still outstanding -- will Postgres require more fiddling and configuration to run well than MySQL? For example, consider that none of the suggestions offered by the commenters here worked.
A hash join uses indexes only if the index supports the independent predicates. Reduce the hash table size to improve performance; either horizontally (less rows) or vertically (less columns). Hash joins cannot perform joins that have range conditions in the join predicates (theta joins).
Nested loop joins are particularly efficient if the outer relation is small, because then the inner loop won't be executed too often. It is the typical join strategy used in OLTP workloads with a normalized data model, where it is highly efficient.
hash join: the right relation is first scanned and loaded into a hash table, using its join attributes as hash keys. Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.
In Postgres, a B-Tree index is what you most commonly want Using an index is much faster than a sequential scan because it may only have to read a few pages as opposed to sequentially scanning thousands of them (when you're returning only a few records).
My guess is that you are using the default random_page_cost = 4
, which is way too high, making index scan too costly.
I try to reconstruct the 2 tables with this script:
CREATE TABLE replays_game ( id integer NOT NULL, PRIMARY KEY (id) ); CREATE TABLE replays_playeringame ( player_id integer NOT NULL, game_id integer NOT NULL, PRIMARY KEY (player_id, game_id), CONSTRAINT replays_playeringame_game_fkey FOREIGN KEY (game_id) REFERENCES replays_game (id) ); CREATE INDEX ix_replays_playeringame_game_id ON replays_playeringame (game_id); -- 150k games INSERT INTO replays_game SELECT generate_series(1, 150000); -- ~150k players, ~2 games each INSERT INTO replays_playeringame select trunc(random() * 149999 + 1), generate_series(1, 150000); INSERT INTO replays_playeringame SELECT * FROM ( SELECT trunc(random() * 149999 + 1) as player_id, generate_series(1, 150000) as game_id ) AS t WHERE NOT EXISTS ( SELECT 1 FROM replays_playeringame WHERE t.player_id = replays_playeringame.player_id AND t.game_id = replays_playeringame.game_id ) ; -- the heavy player with 3000 games INSERT INTO replays_playeringame select 999999, generate_series(1, 3000);
With the default value of 4:
game=# set random_page_cost = 4; SET game=# explain analyse SELECT "replays_game".* FROM "replays_game" INNER JOIN "replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id" WHERE "replays_playeringame"."player_id" = 999999; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=1483.54..4802.54 rows=3000 width=4) (actual time=3.640..110.212 rows=3000 loops=1) Hash Cond: (replays_game.id = replays_playeringame.game_id) -> Seq Scan on replays_game (cost=0.00..2164.00 rows=150000 width=4) (actual time=0.012..34.261 rows=150000 loops=1) -> Hash (cost=1446.04..1446.04 rows=3000 width=4) (actual time=3.598..3.598 rows=3000 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 106kB -> Bitmap Heap Scan on replays_playeringame (cost=67.54..1446.04 rows=3000 width=4) (actual time=0.586..2.041 rows=3000 loops=1) Recheck Cond: (player_id = 999999) -> Bitmap Index Scan on replays_playeringame_pkey (cost=0.00..66.79 rows=3000 width=0) (actual time=0.560..0.560 rows=3000 loops=1) Index Cond: (player_id = 999999) Total runtime: 110.621 ms
After lowering it to 2:
game=# set random_page_cost = 2; SET game=# explain analyse SELECT "replays_game".* FROM "replays_game" INNER JOIN "replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id" WHERE "replays_playeringame"."player_id" = 999999; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=45.52..4444.86 rows=3000 width=4) (actual time=0.418..27.741 rows=3000 loops=1) -> Bitmap Heap Scan on replays_playeringame (cost=45.52..1424.02 rows=3000 width=4) (actual time=0.406..1.502 rows=3000 loops=1) Recheck Cond: (player_id = 999999) -> Bitmap Index Scan on replays_playeringame_pkey (cost=0.00..44.77 rows=3000 width=0) (actual time=0.388..0.388 rows=3000 loops=1) Index Cond: (player_id = 999999) -> Index Scan using replays_game_pkey on replays_game (cost=0.00..0.99 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=3000) Index Cond: (id = replays_playeringame.game_id) Total runtime: 28.542 ms (8 rows)
If using SSD, I would lower it further to 1.1.
As for your last question, I really think you should stick with postgresql. I have experience with postgresql and mssql, and I need to put in triple the effort into the later for it to perform half as well as the former.
I ran sayap's testbed-code (Thanks!) , with the following modifications:
After this run, I did the same run, but scaled up tenfold: with 1M5 records (30K hard-hitters)
Currently, I am running the same test with a hundred-fold scale-up, but the initialisation is rather slow...
Results The entries in the cells are the total time in msec plus a string that denotes the chosen queryplan. (only a handfull of plans occur)
Original 3K / 150K work_mem=16M rpc | 3K | 1K5 | 750 | 375 --------+---------------+---------------+---------------+------------ 8* | 50.8 H.BBi.HS| 44.3 H.BBi.HS| 38.5 H.BBi.HS| 41.0 H.BBi.HS 4 | 43.6 H.BBi.HS| 48.6 H.BBi.HS| 4.34 NBBi | 1.33 NBBi 2 | 6.92 NBBi | 3.51 NBBi | 4.61 NBBi | 1.24 NBBi 1 | 6.43 NII | 3.49 NII | 4.19 NII | 1.18 NII Original 3K / 150K work_mem=64K rpc | 3K | 1K5 | 750 | 375 --------+---------------+---------------+---------------+------------ 8* | 74.2 H.BBi.HS| 69.6 NBBi | 62.4 H.BBi.HS| 66.9 H.BBi.HS 4 | 6.67 NBBi | 8.53 NBBi | 1.91 NBBi | 2.32 NBBi 2 | 6.66 NBBi | 3.6 NBBi | 1.77 NBBi | 0.93 NBBi 1 | 7.81 NII | 3.26 NII | 1.67 NII | 0.86 NII Scaled 10*: 30K / 1M5 work_mem=16M rpc | 30K | 15K | 7k5 | 3k75 --------+---------------+---------------+---------------+------------ 8* | 623 H.BBi.HS| 556 H.BBi.HS| 531 H.BBi.HS| 14.9 NBBi 4 | 56.4 M.I.sBBi| 54.3 NBBi | 27.1 NBBi | 19.1 NBBi 2 | 71.0 NBBi | 18.9 NBBi | 9.7 NBBi | 9.7 NBBi 1 | 79.0 NII | 35.7 NII | 17.7 NII | 9.3 NII Scaled 10*: 30K / 1M5 work_mem=64K rpc | 30K | 15K | 7k5 | 3k75 --------+---------------+---------------+---------------+------------ 8* | 729 H.BBi.HS| 722 H.BBi.HS| 723 H.BBi.HS| 19.6 NBBi 4 | 55.5 M.I.sBBi| 41.5 NBBi | 19.3 NBBi | 13.3 NBBi 2 | 70.5 NBBi | 41.0 NBBi | 26.3 NBBi | 10.7 NBBi 1 | 69.7 NII | 38.5 NII | 20.0 NII | 9.0 NII Scaled 100*: 300K / 15M work_mem=16M rpc | 300k | 150K | 75k | 37k5 --------+---------------+---------------+---------------+--------------- 8* |7314 H.BBi.HS|9422 H.BBi.HS|6175 H.BBi.HS| 122 N.BBi.I 4 | 569 M.I.sBBi| 199 M.I.sBBi| 142 M.I.sBBi| 105 N.BBi.I 2 | 527 M.I.sBBi| 372 N.BBi.I | 198 N.BBi.I | 110 N.BBi.I 1 | 694 NII | 362 NII | 190 NII | 107 NII Scaled 100*: 300K / 15M work_mem=64K rpc | 300k | 150k | 75k | 37k5 --------+---------------+---------------+---------------+------------ 8* |22800 H.BBi.HS |21920 H.BBi.HS | 20630 N.BBi.I |19669 H.BBi.HS 4 |22095 H.BBi.HS | 284 M.I.msBBi| 205 B.BBi.I | 116 N.BBi.I 2 | 528 M.I.msBBi| 399 N.BBi.I | 211 N.BBi.I | 110 N.BBi.I 1 | 718 NII | 364 NII | 200 NII | 105 NII [8*] Note: the RandomPageCost=8 runs were only intended as a prerun to prime the disk buffer cache; the results should be ignored. Legend for node types: N := Nested loop M := Merge join H := Hash (or Hash join) B := Bitmap heap scan Bi := Bitmap index scan S := Seq scan s := sort m := materialise
Preliminary conclusion:
"the working set" for the original query is too small: all of it fits in core, resulting in the cost of page fetches to be grossly overestimated. Setting RPC to 2 (or 1) "solves" this problem, but once the query is scaled-up, the page-costs become dominant, and RPC=4 becomes comparable or even better.
Setting work_mem to a lower value is another way to make the optimiser shift to index-scans (instead of hash+bitmap-scans). The differences I found are smaller than what Sayap reported. Maybe I have more effective_cache_size, or he forgot to prime the cache?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With