Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete query on a very large table running extremely slowly SQL

Tags:

sql

postgresql

I have this SQL query:

delete from scans
    where scandatetime>(current_timestamp - interval '21 days') and
          scandatetime <> (select min(tt.scandatetime) from scans tt where tt.imb = scans.imb) and
          scandatetime <> (select max(tt.scandatetime) from scans tt where tt.imb = scans.imb)
;

That I use to delete records from the following table:

|imb        |scandatetime       |status   |scanfacilityzip|
+-----------+-------------------+---------+---------------+
|isdijh23452|2020-01-01 13:45:12|Intake   |12345          |
|isdijh23452|2020-01-01 13:45:12|Intake   |12345          |
|isdijh23452|2020-01-01 19:30:32|Received |12345          |
|isdijh23452|2020-01-02 04:50:22|Confirmed|12345          |
|isdijh23452|2020-01-03 19:32:18|Processed|45867          |
|awgjnh09864|2020-01-01 10:24:16|Intake   |84676          |
|awgjnh09864|2020-01-01 19:30:32|Received |84676          |
|awgjnh09864|2020-01-01 19:30:32|Received |84676          |
|awgjnh09864|2020-01-02 02:15:52|Processed|84676          |

such that only 2 records remain per IMB, the one with the minimum scandatetime and the maximum scandatetime. I also limit this so it only performs this operation for records that are less than 3 weeks old. The resultant table looks like this:

|imb        |scandatetime       |status   |scanfacilityzip|
+-----------+-------------------+---------+---------------+
|isdijh23452|2020-01-01 13:45:12|Intake   |12345          |
|isdijh23452|2020-01-03 19:32:18|Processed|45867          |
|awgjnh09864|2020-01-01 10:24:16|Intake   |84676          |
|awgjnh09864|2020-01-02 02:15:52|Processed|84676          |

This table has a few indexes and has tens of millions of rows, so the query usually takes forever to run. How can I speed this up?

Explain output:

Delete on scans  (cost=0.57..115934571.45 rows=10015402 width=6)
  ->  Index Scan using scans_staging_scandatetime_idx on scans  (cost=0.57..115934571.45 rows=10015402 width=6)
        Index Cond: (scandatetime > (CURRENT_TIMESTAMP - '21 days'::interval))
        Filter: ((scandatetime <> (SubPlan 2)) AND (scandatetime <> (SubPlan 4)))
        SubPlan 2
          ->  Result  (cost=3.91..3.92 rows=1 width=8)
                InitPlan 1 (returns $1)
                  ->  Limit  (cost=0.70..3.91 rows=1 width=8)
                        ->  Index Only Scan using scans_staging_imb_scandatetime_idx on scans tt  (cost=0.70..16.79 rows=5 width=8)
                              Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
        SubPlan 4
          ->  Result  (cost=3.91..3.92 rows=1 width=8)
                InitPlan 3 (returns $3)
                  ->  Limit  (cost=0.70..3.91 rows=1 width=8)
                        ->  Index Only Scan Backward using scans_staging_imb_scandatetime_idx on scans tt_1  (cost=0.70..16.79 rows=5 width=8)
                              Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))

Table DDL:

-- Table Definition ----------------------------------------------

CREATE TABLE scans (
    imb text,
    scandatetime timestamp without time zone,
    status text,
    scanfacilityzip text
);

-- Indices -------------------------------------------------------

CREATE INDEX scans_staging_scandatetime_idx ON scans(scandatetime timestamp_ops);
CREATE INDEX scans_staging_imb_idx ON scans(imb text_ops);
CREATE INDEX scans_staging_status_idx ON scans(status text_ops);
CREATE INDEX scans_staging_scandatetime_status_idx ON scans(scandatetime timestamp_ops,status text_ops);
CREATE INDEX scans_staging_imb_scandatetime_idx ON scans(imb text_ops,scandatetime timestamp_ops);

Edit: Here is the explain analyze output (note, I changed the interval to 1 day to make it run faster):

Delete on scans  (cost=0.58..3325615.74 rows=278811 width=6) (actual time=831562.877..831562.877 rows=0 loops=1)
  ->  Index Scan using scans_staging_scandatetime_idx on scans  (cost=0.58..3325615.74 rows=278811 width=6) (actual time=831562.875..831562.875 rows=0 loops=1)
        Index Cond: (scandatetime > (CURRENT_TIMESTAMP - '1 day'::interval))
        Filter: ((scandatetime <> (SubPlan 2)) AND (scandatetime <> (SubPlan 4)))
        Rows Removed by Filter: 277756
        SubPlan 2
          ->  Result  (cost=3.92..3.93 rows=1 width=8) (actual time=1.675..1.675 rows=1 loops=277756)
                InitPlan 1 (returns $1)
                  ->  Limit  (cost=0.70..3.92 rows=1 width=8) (actual time=1.673..1.674 rows=1 loops=277756)
                        ->  Index Only Scan using scans_staging_imb_scandatetime_idx on scans tt  (cost=0.70..16.80 rows=5 width=8) (actual time=1.672..1.672 rows=1 loops=277756)
                              Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
                              Heap Fetches: 277761
        SubPlan 4
          ->  Result  (cost=3.92..3.93 rows=1 width=8) (actual time=0.086..0.086 rows=1 loops=164210)
                InitPlan 3 (returns $3)
                  ->  Limit  (cost=0.70..3.92 rows=1 width=8) (actual time=0.084..0.085 rows=1 loops=164210)
                        ->  Index Only Scan Backward using scans_staging_imb_scandatetime_idx on scans tt_1  (cost=0.70..16.80 rows=5 width=8) (actual time=0.083..0.083 rows=1 loops=164210)
                              Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
                              Heap Fetches: 164210
Planning Time: 11.360 ms
Execution Time: 831562.956 ms

EDIT: Result with explain analyze buffers:

Delete on scans  (cost=0.57..1274693.83 rows=103787 width=6) (actual time=19309.026..19309.027 rows=0 loops=1)
  Buffers: shared hit=743430 read=46033
  I/O Timings: read=15917.966
  ->  Index Scan using scans_staging_scandatetime_idx on scans  (cost=0.57..1274693.83 rows=103787 width=6) (actual time=19309.025..19309.025 rows=0 loops=1)
        Index Cond: (scandatetime > (CURRENT_TIMESTAMP - '1 day'::interval))
        Filter: ((scandatetime <> (SubPlan 2)) AND (scandatetime <> (SubPlan 4)))
        Rows Removed by Filter: 74564
        Buffers: shared hit=743430 read=46033
        I/O Timings: read=15917.966
        SubPlan 2
          ->  Result  (cost=4.05..4.06 rows=1 width=8) (actual time=0.232..0.233 rows=1 loops=74564)
                Buffers: shared hit=458108 read=27849
                I/O Timings: read=15114.478
                InitPlan 1 (returns $1)
                  ->  Limit  (cost=0.70..4.05 rows=1 width=8) (actual time=0.231..0.231 rows=1 loops=74564)
                        Buffers: shared hit=458108 read=27849
                        I/O Timings: read=15114.478
                        ->  Index Only Scan using scans_staging_imb_scandatetime_idx on scans tt  (cost=0.70..20.81 rows=6 width=8) (actual time=0.230..0.230 rows=1 loops=74564)
                              Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
                              Heap Fetches: 74583
                              Buffers: shared hit=458108 read=27849
                              I/O Timings: read=15114.478
        SubPlan 4
          ->  Result  (cost=4.05..4.06 rows=1 width=8) (actual time=0.042..0.042 rows=1 loops=34497)
                Buffers: shared hit=228637 read=701
                I/O Timings: read=507.724
                InitPlan 3 (returns $3)
                  ->  Limit  (cost=0.70..4.05 rows=1 width=8) (actual time=0.041..0.041 rows=1 loops=34497)
                        Buffers: shared hit=228637 read=701
                        I/O Timings: read=507.724
                        ->  Index Only Scan Backward using scans_staging_imb_scandatetime_idx on scans tt_1  (cost=0.70..20.81 rows=6 width=8) (actual time=0.040..0.040 rows=1 loops=34497)
                              Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
                              Heap Fetches: 34497
                              Buffers: shared hit=228637 read=701
                              I/O Timings: read=507.724
Planning Time: 5.350 ms
Execution Time: 19313.242 ms
like image 810
DBA108642 Avatar asked Sep 01 '25 20:09

DBA108642


1 Answers

Without the pre-aggregation (and avoiding the CTE):


DELETE FROM scans del
WHERE del.scandatetime > (current_timestamp - interval '21 days')
AND EXISTS (SELECT *
        FROM scans x
        WHERE x.imb = del.imb
        AND x.scandatetime < del.scandatetime
        )
AND EXISTS (SELECT *
        FROM scans x
        WHERE x.imb = del.imb
        AND x.scandatetime > del.scandatetime
        )
        ;

The idea is: you only delete if there is (at least) one record before, and (at least) one after it. (with the same imd) This is not true for the first and last records, only the ones inbetween.

like image 65
wildplasser Avatar answered Sep 03 '25 23:09

wildplasser