Take the following two tables:
Table "public.contacts"
Column | Type | Modifiers | Storage | Stats target | Description
--------------------+-----------------------------+-------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('contacts_id_seq'::regclass) | plain | |
created_at | timestamp without time zone | not null | plain | |
updated_at | timestamp without time zone | not null | plain | |
external_id | integer | | plain | |
email_address | character varying | | extended | |
first_name | character varying | | extended | |
last_name | character varying | | extended | |
company | character varying | | extended | |
industry | character varying | | extended | |
country | character varying | | extended | |
region | character varying | | extended | |
ext_instance_id | integer | | plain | |
title | character varying | | extended | |
Indexes:
"contacts_pkey" PRIMARY KEY, btree (id)
"index_contacts_on_ext_instance_id_and_external_id" UNIQUE, btree (ext_instance_id, external_id)
and
Table "public.members"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------+-----------------------------+--------------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('members_id_seq'::regclass) | plain | |
step_id | integer | | plain | |
contact_id | integer | | plain | |
rule_id | integer | | plain | |
request_id | integer | | plain | |
sync_id | integer | | plain | |
status | integer | not null default 0 | plain | |
matched_targeted_rule | boolean | default false | plain | |
external_fields | jsonb | | extended | |
imported_at | timestamp without time zone | | plain | |
campaign_id | integer | | plain | |
ext_instance_id | integer | | plain | |
created_at | timestamp without time zone | | plain | |
Indexes:
"members_pkey" PRIMARY KEY, btree (id)
"index_members_on_contact_id_and_step_id" UNIQUE, btree (contact_id, step_id)
"index_members_on_campaign_id" btree (campaign_id)
"index_members_on_step_id" btree (step_id)
"index_members_on_sync_id" btree (sync_id)
"index_members_on_request_id" btree (request_id)
"index_members_on_status" btree (status)
Indices exist for both primary keys and members.contact_id
.
I need to delete any contact
which has no related members
. There are roughly 3MM contact
and 25MM member
records.
I'm attempting the following two queries:
DELETE FROM "contacts"
WHERE "contacts"."id" IN (SELECT "contacts"."id"
FROM "contacts"
LEFT OUTER JOIN members
ON
members.contact_id = contacts.id
WHERE members.id IS NULL);
DELETE 0
Time: 173033.801 ms
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2654306.79..2654307.86 rows=1 width=18) (actual time=188717.354..188717.354 rows=0 loops=1)
-> Nested Loop (cost=2654306.79..2654307.86 rows=1 width=18) (actual time=188717.351..188717.351 rows=0 loops=1)
-> HashAggregate (cost=2654306.36..2654306.37 rows=1 width=16) (actual time=188717.349..188717.349 rows=0 loops=1)
Group Key: contacts_1.id
-> Hash Right Join (cost=161177.46..2654306.36 rows=1 width=16) (actual time=188717.345..188717.345 rows=0 loops=1)
Hash Cond: (members.contact_id = contacts_1.id)
Filter: (members.id IS NULL)
Rows Removed by Filter: 26725870
-> Seq Scan on members (cost=0.00..1818698.96 rows=25322396 width=14) (actual time=0.043..160226.686 rows=26725870 loops=1)
-> Hash (cost=105460.65..105460.65 rows=3205265 width=10) (actual time=1962.612..1962.612 rows=3196180 loops=1)
Buckets: 262144 Batches: 4 Memory Usage: 34361kB
-> Seq Scan on contacts contacts_1 (cost=0.00..105460.65 rows=3205265 width=10) (actual time=0.011..950.657 rows=3196180 loops=1)
-> Index Scan using contacts_pkey on contacts (cost=0.43..1.48 rows=1 width=10) (never executed)
Index Cond: (id = contacts_1.id)
Planning time: 0.488 ms
Execution time: 188718.862 ms
DELETE FROM contacts
WHERE NOT EXISTS (SELECT 1
FROM members c
WHERE c.contact_id = contacts.id);
DELETE 0
Time: 170871.219 ms
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2258873.91..2954594.50 rows=1895601 width=12) (actual time=177523.034..177523.034 rows=0 loops=1)
-> Hash Anti Join (cost=2258873.91..2954594.50 rows=1895601 width=12) (actual time=177523.029..177523.029 rows=0 loops=1)
Hash Cond: (contacts.id = c.contact_id)
-> Seq Scan on contacts (cost=0.00..105460.65 rows=3205265 width=10) (actual time=0.018..1068.357 rows=3196180 loops=1)
-> Hash (cost=1818698.96..1818698.96 rows=25322396 width=10) (actual time=169587.802..169587.802 rows=26725870 loops=1)
Buckets: 262144 Batches: 32 Memory Usage: 36228kB
-> Seq Scan on members c (cost=0.00..1818698.96 rows=25322396 width=10) (actual time=0.052..160081.880 rows=26725870 loops=1)
Planning time: 0.901 ms
Execution time: 177524.526 ms
As you can see that without even deleting any records both queries show similar performance taking ~3 minutes.
The server disk I/O spikes to 100% so I'm assuming that data is being spilled out to the disk because a sequential scan is done on both contacts
and members
.
The server is an EC2 r3.large (15GB RAM).
Any ideas on what I can do to optimize this query?
After running vacuum analyze
for both tables and ensuring enable_mergejoin
is set to on
there is no difference in the query time:
DELETE FROM contacts
WHERE NOT EXISTS (SELECT 1
FROM members c
WHERE c.contact_id = contacts.id);
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.342..209406.342 rows=0 loops=1)
-> Hash Anti Join (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.338..209406.338 rows=0 loops=1)
Hash Cond: (contacts.id = c.contact_id)
-> Seq Scan on contacts (cost=0.00..105683.28 rows=3227528 width=10) (actual time=0.008..1010.643 rows=3227462 loops=1)
-> Hash (cost=1814029.74..1814029.74 rows=24855474 width=10) (actual time=198054.302..198054.302 rows=27307060 loops=1)
Buckets: 262144 Batches: 32 Memory Usage: 37006kB
-> Seq Scan on members c (cost=0.00..1814029.74 rows=24855474 width=10) (actual time=1.132..188654.555 rows=27307060 loops=1)
Planning time: 0.328 ms
Execution time: 209408.040 ms
PG Version:
PostgreSQL 9.4.4 on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (Gentoo Hardened 4.5.4 p1.0, pie-0.4.7) 4.5.4, 64-bit
Relation size:
Table | Size | External Size
-----------------------+---------+---------------
members | 23 GB | 11 GB
contacts | 944 MB | 371 MB
Settings:
work_mem
----------
64MB
random_page_cost
------------------
4
Experimenting with doing this in batches doesn't seem to help out on the I/O usage (still spikes to 100%) and doesn't seem to improve on time despite using index-based plans.
DO $do$
BEGIN
FOR i IN 57..668
LOOP
DELETE
FROM contacts
WHERE contacts.id IN
(
SELECT contacts.id
FROM contacts
left outer join members
ON members.contact_id = contacts.id
WHERE members.id IS NULL
AND contacts.id >= (i * 10000)
AND contacts.id < ((i+1) * 10000));
END LOOP;END $do$;
I had to kill the query after Time: 1203492.326 ms
and disk I/O stayed at 100% the entire time the query ran. I also experimented with 1,000 and 5,000 chunks but did not see any increase in performance.
Note: The 57..668
range was used because I know those are existing contact IDs. (E.g. min(id)
and max(id)
)
One approach to problems like this can be to do it in smaller chunks.
DELETE FROM "contacts"
WHERE "contacts"."id" IN (
SELECT id
FROM contacts
LEFT OUTER JOIN members ON members.contact_id = contacts.id
WHERE members.id IS NULL
AND id >= 1 AND id < 1000
);
DELETE FROM "contacts"
WHERE "contacts"."id" IN (
SELECT id
FROM contacts
LEFT OUTER JOIN members ON members.contact_id = contacts.id
WHERE members.id IS NULL
AND id >= 1001 AND id < 2000
);
Rinse, repeat. Experiment with different chunk sizes to find an optimal one for your data set, which uses the fewest queries, while keeping them all in memory.
Naturally, you would want to script this, possibly in plpgsql, or in whatever scripting language you prefer.
Update statistics used by the planner and set enable_mergejoin
to on
:
vacuum analyse members;
vacuum analyse contacts;
set enable_mergejoin to on;
You should get a query plan similar to this one:
explain analyse
delete from contacts
where not exists (
select 1
from members c
where c.contact_id = contacts.id);
QUERY PLAN
----------------------------------------------------------------------
Delete on contacts
-> Merge Anti Join
Merge Cond: (contacts.id = c.contact_id)
-> Index Scan using contacts_pkey on contacts
-> Index Scan using members_contact_id_idx on members c
Any ideas on what I can do to optimize this query?
Your queries are perfect. I would use the NOT EXISTS
variant.
Your index index_members_on_contact_id_and_step_id
is also good for it:
But see below about BRIN indexes.
You can tune your server, table and index configuration.
Since you do not actually update or delete many rows (hardly any at all, according to your comment?), you need to optimize read performance.
You provided:
The server is an EC2 r3.large (15GB RAM).
And:
PostgreSQL 9.4.4
Your version is seriously outdated. At least upgrade to the latest minor version. Better yet, upgrade to the current major version. Postgres 9.5 and 9.6 brought major improvements for big data - which is what you need exactly.
Consider the versioning policy of the project.
Amazon allows you to upgrade!
There is an unexpected 10% mismatch between expected and actual row count in the basic sequential scan:
Seq Scan on members c (cost=0.00..1814029.74 rows=24855474 width=10) (actual time=1.132..188654.555 rows=27307060 loops=1)
Not dramatic at all, but still should not occur in this query. Indicates that you might have to tune your autovacuum settings - possibly per table for the very big ones.
More problematic:
Hash Anti Join (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.338..209406.338 rows=0 loops=1)
Postgres expects to find 1875003 rows to delete, while actually 0 rows are found. That's unexpected. Maybe substantially increasing the statistics target on members.contact_id
and contacts.id
can help to decrease the gap, which might allow better query plans. See:
Your ~ 25MM rows in members
occupy 23 GB - that's almost 1kb per row, which seems excessive for the table definition you presented (even if the total size you provided should include indexes):
4 bytes item identifier
24 tuple header
8 null bitmap
36 9x integer
16 2x ts
1 1x bool
?? 1x jsonb
See:
That's 89 bytes per row - or less with some NULL values - and hardly any alignment padding, so 96 bytes max, plus your jsonb
column.
Either that jsonb
column is very big which would make me suggest to normalize the data into separate columns or a separate table. Consider:
Or your table is bloated, which can be solved with VACUUM FULL ANALYZE
or, while being at it:
CLUSTER members USING index_members_on_contact_id_and_step_id;
VACUUM members;
But either takes an exclusive lock on the table, which you say you cannot afford. pg_repack
can do it without exclusive lock. See:
Even if we factor in index sizes, your table seems too big: you have 7 small indexes, each 36 - 44 bytes per row without bloat, less with NULL values, so < 300 bytes altogether.
Either way, consider more aggressive autovacuum settings for your table members
. Related:
And / or stop bloating the table to begin with. Are you updating rows a lot? Any particular column you update a lot? That jsonb
column maybe? You might move that to a separate (1:1) table just to stop bloating the main table with dead tuples - and keeping autovacuum from doing its job.
Block range indexes require Postgres 9.5 or later and dramatically reduce index size. I was too optimistic in my first draft. A BRIN index is perfect for your use case if you have many rows in members
for each contact.id
- after physically clustering your table at least once (see ③ for the fitting CLUSTER
command). In that case Postgres can rule out whole data pages quickly. But your numbers indicate only around 8 rows per contact.id
, so data pages would often contain multiple values, which voids much of the effect. Depends on actual details of your data distribution ...
On the other hand, as it stands, your tuple size is around 1 kb, so only ~ 8 rows per data page (typically 8kb). If that isn't mostly bloat, a BRIN index might help after all.
But you need to upgrade your server version first. See ①.
CREATE INDEX members_contact_id_brin_idx ON members USING BRIN (contact_id);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With