Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently delete rows from a Postgresql 8.1 table?

I'm working on a PostgreSQL 8.1 SQL script which needs to delete a large number of rows from a table.

Let's say the table I need to delete from is Employees (~260K rows). It has primary key named id.

The rows I need to delete from this table are stored in a separate temporary table called EmployeesToDelete (~10K records) with a foreign key reference to Employees.id called employee_id.

Is there an efficient way to do this?

At first, I thought of the following:

DELETE
FROM    Employees
WHERE   id IN
        (
        SELECT  employee_id
        FROM    EmployeesToDelete
        )

But I heard that using the "IN" clause and subqueries can be inefficient, especially with larger tables.

I've looked at the PostgreSQL 8.1 documentation, and there's mention of DELETE FROM ... USING but it doesn't have examples so I'm not sure how to use it.

I'm wondering if the following works and is more efficient?

DELETE
FROM    Employees
USING   Employees e
INNER JOIN
        EmployeesToDelete ed
ON      e.id = ed.employee_id

Your comments are greatly appreciated.

Edit: I ran EXPLAIN ANALYZE and the weird thing is that the first DELETE ran pretty quickly (within seconds), while the second DELETE took so long (over 20 min) I eventually cancelled it.

Adding an index to the temp table helped the performance quite a bit.

Here's a query plan of the first DELETE for anyone interested:

 Hash Join  (cost=184.64..7854.69 rows=256482 width=6) (actual time=54.089..660.788 rows=27295 loops=1)
   Hash Cond: ("outer".id = "inner".employee_id)
   ->  Seq Scan on Employees  (cost=0.00..3822.82 rows=256482 width=10) (actual time=15.218..351.978 rows=256482 loops=1)
   ->  Hash  (cost=184.14..184.14 rows=200 width=4) (actual time=38.807..38.807 rows=10731 loops=1)
         ->  HashAggregate  (cost=182.14..184.14 rows=200 width=4) (actual time=19.801..28.773 rows=10731 loops=1)
               ->  Seq Scan on EmployeesToDelete  (cost=0.00..155.31 rows=10731 width=4) (actual time=0.005..9.062 rows=10731 loops=1)

 Total runtime: 935.316 ms
(7 rows)

At this point, I'll stick with the first DELETE unless I can find a better way of writing it.

like image 776
Jin Kim Avatar asked Apr 22 '09 15:04

Jin Kim


People also ask

How do I delete multiple rows in PostgreSQL?

First, specify the table from which you want to delete data in the DELETE FROM clause. Second, specify which rows to delete by using the condition in the WHERE clause. The WHERE clause is optional. However, if you omit it, the DELETE statement will delete all rows in the table.

Which SQL command can be used to delete all records from the table in PostgreSQL?

TRUNCATE provides a faster mechanism to remove all rows from a table.


2 Answers

Don't guess, measure. Try the various methods and see which one is the shortest to execute. Also, use EXPLAIN to know what PostgreSQL will do and see where you can optimize. Very few PostgreSQL users are able to guess correctly the fastest query...

like image 129
bortzmeyer Avatar answered Oct 27 '22 00:10

bortzmeyer


I'm wondering if the following works and is more efficient?

    DELETE
    FROM    Employees e
    USING   EmployeesToDelete ed
    WHERE   id = ed.employee_id;

This totally depend on your index selectivity.

PostgreSQL tends to employ MERGE IN JOIN for IN predicates, which has stable execution time.

It's not affected by how many rows satisfy this condition, provided that you already have an ordered resultset.

An ordered resultset requires either a sort operation or an index. Full index traversal is very inefficient in PostgreSQL compared to SEQ SCAN.

The JOIN predicate, on the other hand, may benefit from using NESTED LOOPS if your index is very selective, and from using HASH JOIN is it's inselective.

PostgreSQL should select the right one by estimating the row count.

Since you have 30k rows against 260K rows, I expect HASH JOIN to be more efficient, and you should try to build a plan on a DELETE ... USING query.

To make sure, please post execution plan for both queries.

like image 33
Quassnoi Avatar answered Oct 27 '22 01:10

Quassnoi