Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clean up SQL data before unique constraint

I want to clean up some data in a table before putting in a unique constraint on two columns.

CREATE TABLE test (
 a integer NOT NULL,
 b integer NOT NULL,
 c integer NOT NULL,
 CONSTRAINT a_pk PRIMARY KEY (a)
);

INSERT INTO test (a,b,c) VALUES
 (1,2,3)
,(2,2,3)
,(3,4,3)
,(4,4,4)
,(5,4,5)
,(6,4,4)
,(7,4,4);

-- SELECT a FROM test WHERE ????

Output should be 2,6,7

I am looking for all rows after the first that have duplicated b,c

EX:

  • Rows 1,2 have (2,3) as b,c Row 1 is ok because it is the first, 2 is not.

  • Rows 4,6,7 have (4,4) as b,c Row 4 is ok because it is the first, 6,7 are not.

I will then:

DELETE FROM test WHERE a = those IDs;

.. and add the unique constraint.

I was thinking about an intersect on test with itself but not sure where to do go from there.

like image 800
sjakubowski Avatar asked Dec 27 '22 00:12

sjakubowski


2 Answers

I ran a couple of tests. The EXISTS variant proves to be substantially faster - as I expected and contrary to what @Tometzky posted.

Test bed with 10.000 rows on PostgreSQL 9.1.2 with decent settings:

CREATE TEMP TABLE test (
  a serial
 ,b int NOT NULL
 ,c int NOT NULL
);

INSERT INTO test (b,c)
SELECT (random()* 100)::int AS b, (random()* 100)::int AS c
FROM   generate_series(1, 10000);

ALTER TABLE test ADD CONSTRAINT a_pk PRIMARY KEY (a);

Between the first and second round of tests, I ran:

ANALYZE test;

When I finally applied the DELETE, 3368 duplicates were deleted. Performance may vary if you have substantially more or fewer duplicates.

I ran each query a couple of times with EXPLAIN ANALYZE and took the best result. Generally, the best hardly differs from the first or worst.
A bare SELECT (without the DELETE) shows similar results.

1. CTE with rank()

Total runtime: 150.411 ms
Total runtime: 149.853 ms -- after ANALYZE

WITH x AS (
    SELECT a
          ,rank() OVER (PARTITION BY b, c ORDER BY a) AS rk
    FROM   test
    )
DELETE FROM test
USING  x
WHERE  x.a = test.a
AND    rk > 1;

2. CTE with row_number()

Total runtime: 148.240 ms
Total runtime: 147.711 ms -- after ANALYZE

WITH x AS (
    SELECT a
          ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
    FROM   test
    )
DELETE FROM test
USING  x
WHERE  x.a = test.a
AND    rn > 1;

3. row_number() in subquery

Total runtime: 134.753 ms
Total runtime: 134.298 ms -- after ANALYZE

DELETE FROM test
USING (
    SELECT a
          ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
    FROM   test
    )  x
WHERE  x.a = test.a
AND    rn > 1;

4. EXISTS semi-join

Total runtime: 143.777 ms
Total runtime: 69.072 ms -- after ANALYZE

DELETE FROM test t
WHERE EXISTS (
    SELECT 1
    FROM   test t1
    WHERE  t1.a < t.a
    AND   (t1.b, t1.c) = (t.b, t.c)
    );

The difference in the second run comes from a switch to a Hash Semi Join instead of an additional Sort + Merge Semi Join.

Results

  • EXISTS clearly wins with up-tp-date table statistics.
  • With outdated statistics row_number() in a subquery is fastest.
  • rank() is the slowest variant.
  • CTE is slower than subquery.
  • ANALYZE (updated statistics) helps performance and can help a lot. Autovacuum (default) should more or less take care of this automatically - except for temporary tables or immediately after major changes to the table. Read more here or here.

Test with 100.000 rows

I repeated the test with 100.000 rows and 63045 duplicates. Similar results, except that EXISTS was slower, even after ANALYZE.

  1. Total runtime: 1648.601 ms
  2. Total runtime: 1623.759 ms
  3. Total runtime: 1568.893 ms
  4. Total runtime: 1692.249 ms

Raising the statistics target to 1000 and then to the maximum of 10000 (overkill in real live) and another ANALYZE sped up all queries by ~ 1 %, but the query planner still went with Sort + Merge Semi Join for EXISTS.

ALTER TABLE test ALTER COLUMN b SET STATISTICS 10000;
ALTER TABLE test ALTER COLUMN c SET STATISTICS 10000;
ANALYZE test;

Only after I forced the planner to avoid the merge joins the planner used a Hash Semi Join taking half the time again:

SET enable_mergejoin = off
  1. Total runtime: 850.615 ms

Update

There have been improvements to the query planner since then. Went straight to Hash Semi Join in a retest with PostgreSQL 9.1.7.

like image 99
Erwin Brandstetter Avatar answered Jan 08 '23 09:01

Erwin Brandstetter


Using window functions should be much faster than this answer:

select a
from (
  select a, rank() over (partition by b, c order by a) as rank
  from test ) as _
where rank>1;
like image 41
Tometzky Avatar answered Jan 08 '23 08:01

Tometzky