I need to delete the majority (say, 90%) of a very large table (say, 5m rows). The other 10% of this table is frequently read, but not written to.
From "Best way to delete millions of rows by ID", I gather that I should remove any index on the 90% I'm deleting, to speed up the process (except an index I'm using to select the rows for deletion).
From "PostgreSQL locking mode", I see that this operation will acquire a ROW EXCLUSIVE lock on the entire table. But since I'm only reading the other 10%, this ought not matter.
So, is it safe to delete everything in one command (i.e. DELETE FROM table WHERE delete_flag='t')? I'm worried that if the deletion of one row fails, triggering an enormous rollback, then it will affect my ability to read from the table. Would it be wiser to delete in batches?
Indexes are typically useless for operations on 90% of all rows. Sequential scans will be faster either way. (Exotic exceptions apply.)
If you need to allow concurrent reads, you cannot take an exclusive lock on the table. So you also cannot drop any indexes in the same transaction.
You could drop indexes in separate transactions to keep the duration of the exclusive lock at a minimum. In Postgres 9.2 or later you can also use DROP INDEX CONCURRENTLY, which only needs minimal locks. Later use CREATE INDEX CONCURRENTLY to rebuild the index in the background - and only take a very brief exclusive lock.
If you have a stable condition to identify the 10 % (or less) of rows that stay, I would suggest a partial index on just those rows to get the best for both:
DELETE is not going to modify the partial index at all, since none of the rows are involved in the DELETE.CREATE INDEX foo (some_id) WHERE delete_flag = FALSE;
Assuming delete_flag is boolean. You have to include the same predicate in your queries (even if it seems logically redundant) to make sure Postgres can the partial index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With