Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to delete millions of rows by ID

I need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days.

I tried putting them in a table and doing it in batches of 100. 4 days later, this is still running with only 297268 rows deleted. (I had to select 100 id's from an ID table, delete where IN that list, delete from ids table the 100 I selected).

I tried:

DELETE FROM tbl WHERE id IN (select * from ids) 

That's taking forever, too. Hard to gauge how long, since I can't see it's progress till done, but the query was still running after 2 days.

Just kind of looking for the most effective way to delete from a table when I know the specific ID's to delete, and there are millions of IDs.

like image 210
Anthony Greco Avatar asked Nov 28 '11 02:11

Anthony Greco


People also ask

How do I delete a large number of rows?

Press F5 or Ctrl+G to activate the Go To dialog. Enter 501:10000 in the Reference box, then press Enter or click OK. Either right-click anywhere in the selected range and select Delete from the context menu, or click the upper half of the Delete button in the Cells group of the Home tab of the ribbon.

How do you delete 10000 records in SQL?

If you need to remove 10 million rows and have 1 GB of log space available use Delete TOP(10000) From dbo. myTable (with your select clause) and keep running it till there are no more rows to delete.


2 Answers

It all depends ...

  • Assuming no concurrent write access to involved tables or you may have to lock tables exclusively or this route may not be for you at all.

  • Delete all indexes (possibly except the ones needed for the delete itself).
    Recreate them afterwards. That's typically much faster than incremental updates to indexes.

  • Check if you have triggers that can safely be deleted / disabled temporarily.

  • Do foreign keys reference your table? Can they be deleted? Temporarily deleted?

  • Depending on your autovacuum settings it may help to run VACUUM ANALYZE before the operation.

  • Some of the points listed in the related chapter of the manual Populating a Database may also be of use, depending on your setup.

  • If you delete large portions of the table and the rest fits into RAM, the fastest and easiest way may be this:

BEGIN; -- typically faster and safer wrapped in a single transaction  SET LOCAL temp_buffers = '1000MB'; -- enough to hold the temp table  CREATE TEMP TABLE tmp AS SELECT t.* FROM   tbl t LEFT   JOIN del_list d USING (id) WHERE  d.id IS NULL;      -- copy surviving rows into temporary table -- ORDER BY ?             -- optionally order favorably while being at it  TRUNCATE tbl;             -- empty table - truncate is very fast for big tables  INSERT INTO tbl TABLE tmp;        -- insert back surviving rows.  COMMIT; 

This way you don't have to recreate views, foreign keys or other depending objects. And you get a pristine (sorted) table without bloat.

Read about the temp_buffers setting in the manual. This method is fast as long as the table fits into memory, or at least most of it. The transaction wrapper defends against losing data if your server crashes in the middle of this operation.

Run VACUUM ANALYZE afterwards. Or (typically not necessary after going the TRUNCATE route) VACUUM FULL ANALYZE to bring it to minimum size (takes exclusive lock). For big tables consider the alternatives CLUSTER / pg_repack or similar:

  • Optimize Postgres query on timestamp range

For small tables, a simple DELETE instead of TRUNCATE is often faster:

DELETE FROM tbl t USING  del_list d WHERE  t.id = d.id; 

Read the Notes section for TRUNCATE in the manual. In particular (as Pedro also pointed out in his comment):

TRUNCATE cannot be used on a table that has foreign-key references from other tables, unless all such tables are also truncated in the same command. [...]

And:

TRUNCATE will not fire any ON DELETE triggers that might exist for the tables.

like image 127
Erwin Brandstetter Avatar answered Oct 05 '22 18:10

Erwin Brandstetter


I just hit this issue myself and for me the, by far, fastest method was using WITH Queries in combination with USING

Basically the WITH-query creates a temporary table with the primary keys to delete in the table you want to delete from.

WITH to_delete AS (    SELECT item_id FROM other_table WHERE condition_x = true ) DELETE FROM table  USING to_delete  WHERE table.item_id = to_delete.item_id    AND NOT to_delete.item_id IS NULL; 

Ofcourse the SELECT inside the WITH-query can be as complex as any other select with multiple joins etc. It just has to return one or more columns that are used to identify the items in the target table that need to be deleted.

NOTE: AND NOT to_delete.item_id IS NULL most likely is not necessary, but I didn't dare to try.

Other things to consider are

  1. creating indexes on other tables referring to this one via foreign key. Which can reduce a delete taking hours to mere seconds in certain situations
  2. deferring constraint checks: It's not clear how much, if any improvement this achieves, but according to this it can increase performance. Downside is, if you have a foreign key violation you will learn it only at the very last moment.
  3. DANGEROUS but big possible boost: disable constaint checks and triggers during the delete
like image 35
Torge Avatar answered Oct 05 '22 16:10

Torge