I have a simple query to join two tables that's being really slow. I found out that the query plan does a seq scan on the large table email_activities
(~10m rows) while I think using indexes doing nested loops will actually be faster.
I rewrote the query using a subquery in an attempt to force the use of index, then noticed something interesting. If you look at the two query plans below, you will see that when I limit the result set of subquery to 43k, query plan does use index on email_activities while setting the limit in subquery to even 44k will cause query plan to use seq scan on email_activities
. One is clearly more efficient than the other, but Postgres doesn't seem to care.
What could cause this? Does it have a configs somewhere that forces the use of hash join if one of the set is larger than certain size?
explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 43000); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=118261.50..118261.50 rows=1 width=4) (actual time=224.556..224.556 rows=1 loops=1) -> Nested Loop (cost=3699.03..118147.99 rows=227007 width=4) (actual time=32.586..209.076 rows=40789 loops=1) -> HashAggregate (cost=3698.94..3827.94 rows=43000 width=4) (actual time=32.572..47.276 rows=43000 loops=1) -> Limit (cost=0.09..3548.44 rows=43000 width=4) (actual time=0.017..22.547 rows=43000 loops=1) -> Index Scan using index_email_recipients_on_email_campaign_id on email_recipients (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.017..19.168 rows=43000 loops=1) Index Cond: (email_campaign_id = 1607) -> Index Only Scan using index_email_activities_on_email_recipient_id on email_activities (cost=0.09..2.64 rows=5 width=4) (actual time=0.003..0.003 rows=1 loops=43000) Index Cond: (email_recipient_id = email_recipients.id) Heap Fetches: 40789 Total runtime: 224.675 ms
And:
explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 50000); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=119306.25..119306.25 rows=1 width=4) (actual time=3050.612..3050.613 rows=1 loops=1) -> Hash Semi Join (cost=4451.08..119174.27 rows=263962 width=4) (actual time=1831.673..3038.683 rows=47935 loops=1) Hash Cond: (email_activities.email_recipient_id = email_recipients.id) -> Seq Scan on email_activities (cost=0.00..107490.96 rows=9359988 width=4) (actual time=0.003..751.988 rows=9360039 loops=1) -> Hash (cost=4276.08..4276.08 rows=50000 width=4) (actual time=34.058..34.058 rows=50000 loops=1) Buckets: 8192 Batches: 1 Memory Usage: 1758kB -> Limit (cost=0.09..4126.08 rows=50000 width=4) (actual time=0.016..27.302 rows=50000 loops=1) -> Index Scan using index_email_recipients_on_email_campaign_id on email_recipients (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.016..22.244 rows=50000 loops=1) Index Cond: (email_campaign_id = 1607) Total runtime: 3050.660 ms
As we saw above, running a couple of queries on our posts table reveals that even given an index to use, Postgres will not always choose to use it. The reason why this is the case is that indexes have a cost to create and maintain (on writes) and use (on reads).
If the SELECT returns more than approximately 5-10% of all rows in the table, a sequential scan is much faster than an index scan. This is because an index scan requires several IO operations for each row (look up the row in the index, then retrieve the row from the heap).
3) index scan is faster than a table scan because they look at sorted data and query optimizers know when to stop and look for another range. 4) index seek is the fastest way to retrieve data and it comes into the picture when your search criterion is very specific.
If you have a SELECT * query but don't actually need all columns, you may be able to use an index-only scan just by changing the column list.
For few rows it pays to run an index scan. If enough data pages are visible to all (= vacuumed enough, and not too much concurrent write load) and the index can provide all column values needed, then a faster index only scan is used. With more rows expected to be returned (higher percentage of the table and depending on data distribution, value frequencies and row width) it becomes more likely to find several rows on one data page. Then it pays to switch to a bitmap index scans. (Or to combine multiple distinct indexes.) Once a large percentage of data pages has to be visited anyway, it's cheaper to run a sequential scan, filter surplus rows and skip the overhead for indexes altogether.
Index usage becomes (much) cheaper and more likely when accessing data pages in random order is not (much) more expensive than accessing them in sequential order. That's the case when using SSD instead of spinning disks, or even more so the more is cached in RAM - and the respective configuration parameters random_page_cost
and effective_cache_size
are set accordingly.
In your case, Postgres switches to a sequential scan, expecting to find rows=263962
, that's already 3 % of the whole table. (While only rows=47935
are actually found, see below.)
More in this related answer:
You cannot force a certain planner method directly in Postgres, but you can make other methods seem extremely expensive for debugging purposes. See Planner Method Configuration in the manual.
SET enable_seqscan = off
(like suggested in another answer) does that to sequential scans. But that's intended for debugging purposes in your session only. Do not use this as a general setting in production unless you know exactly what you are doing. It can force ridiculous query plans. The manual:
These configuration parameters provide a crude method of influencing the query plans chosen by the query optimizer. If the default plan chosen by the optimizer for a particular query is not optimal, a temporary solution is to use one of these configuration parameters to force the optimizer to choose a different plan. Better ways to improve the quality of the plans chosen by the optimizer include adjusting the planner cost constants (see Section 19.7.2), running
ANALYZE
manually, increasing the value of thedefault_statistics_target
configuration parameter, and increasing the amount of statistics collected for specific columns usingALTER TABLE SET STATISTICS
.
That's already most of the advice you need.
In this particular case, Postgres expects 5-6 times more hits on email_activities.email_recipient_id
than are actually found:
estimated
rows=227007
vs.actual ... rows=40789
estimatedrows=263962
vs.actual ... rows=47935
If you run this query often it will pay to have ANALYZE
look at a bigger sample for more accurate statistics on the particular column. Your table is big (~ 10M rows), so make that:
ALTER TABLE email_activities ALTER COLUMN email_recipient_id SET STATISTICS 3000; -- max 10000, default 100
Then ANALYZE email_activities;
In very rare cases you might resort to force an index with SET LOCAL enable_seqscan = off
in a separate transaction or in a function with its own environment. Like:
CREATE OR REPLACE FUNCTION f_count_dist_recipients(_email_campaign_id int, _limit int) RETURNS bigint AS $func$ SELECT COUNT(DISTINCT a.email_recipient_id) FROM email_activities a WHERE a.email_recipient_id IN ( SELECT id FROM email_recipients WHERE email_campaign_id = $1 LIMIT $2) -- or consider query below $func$ LANGUAGE sql VOLATILE COST 100000 SET enable_seqscan = off;
The setting only applies to the local scope of the function.
Warning: This is just a proof of concept. Even this much less radical manual intervention might bite you in the long run. Cardinalities, value frequencies, your schema, global Postgres settings, everything changes over time. You are going to upgrade to a new Postgres version. The query plan you force now, may become a very bad idea later.
And typically this is just a workaround for a problem with your setup. Better find and fix it.
Essential information is missing in the question, but this equivalent query is probably faster and more likely to use an index on (email_recipient_id
) - increasingly so for a bigger LIMIT
.
SELECT COUNT(*) AS ct FROM ( SELECT id FROM email_recipients WHERE email_campaign_id = 1607 LIMIT 43000 ) r WHERE EXISTS ( SELECT FROM email_activities WHERE email_recipient_id = r.id);
A sequential scan can be more efficient, even when an index exists. In this case, postgres seems to estimate things rather wrong. An ANALYZE <TABLE>
on all related tables can help in such cases. If it doesnt, you can set the variable enable_seqscan
to OFF, to force postgres to use an index whenever technically possible, at the expense, that sometimes an index-scan will be used when a sequential scan would perform better.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With