Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to delete large no of random rows in PostgreSQL

Tags:

postgresql

I have a table which contains about 900K rows.I want to delete about 90% of the rows. Tried using TABLESAMPLE to select them randomly but didn't get much performance improvement. Here are the queries which i have tried and there times

sql> DELETE FROM users WHERE id IN (
SELECT id FROM users ORDER BY random() LIMIT 5000
)
[2017-11-22 11:35:39] 5000 rows affected in 1m 11s 55ms


sql> DELETE FROM users WHERE id IN (
SELECT id FROM users TABLESAMPLE BERNOULLI (5)
)
[2017-11-22 11:55:07] 5845 rows affected in 1m 13s 666ms


sql> DELETE FROM users WHERE id IN (
SELECT id FROM users TABLESAMPLE SYSTEM (5)
)
[2017-11-22 11:57:59] 5486 rows affected in 1m 4s 574ms

Only deleting 5% data takes about an min. So this is going to take very long for large data. Pls suggest if I am doing things right or if there is any better way to do this.

like image 327
Pooja Avatar asked Dec 23 '22 12:12

Pooja


2 Answers

Deleting a large number of rows is always going to be slow. The way how you identify them won't make much difference.

Instead of deleting a large number it's usually a lot faster, to create a new table that contains those rows that you want to keep, e.g.:

create table users_to_keep
as
select *
from users
tablesample system (10);

then truncate the original table and insert the rows that you stored away:

truncate table users;
insert into users
select *
from users_to_keep;

If you want, you can do that in a single transaction.

like image 54
a_horse_with_no_name Avatar answered Jan 10 '23 08:01

a_horse_with_no_name


As a_horse_with_no_name pointed out, the random selection itself is a relatively minor factor. And much of the cost associated with a deletion (e.g. foreign key checks) is not something you can avoid.

The only thing which stands out as an unnecessary overhead is the id-based lookup in the DELETE statement; you just visited the row during the random selection step, and now you're looking it up again, presumably via an index on id.

Instead, you can perform the lookup using the row's physical location, represented by the hidden ctid column:

DELETE FROM users WHERE ctid = ANY(ARRAY(
  SELECT ctid FROM users TABLESAMPLE SYSTEM (5)
))

This gave me a ~6x speedup in an artificial test, though it will likely be dwarfed by other costs in most real-world scenarios.

like image 32
Nick Barnes Avatar answered Jan 10 '23 08:01

Nick Barnes