Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove duplicate rows with foreign keys dependencies?

I am trying to write a simple stored procedure in PostgreSQL 9.1 that will remove duplicate entries from a parent cpt table. The parent table cpt is referenced by a child table lab defined as:

CREATE TABLE lab (
 recid serial NOT NULL,
 cpt_recid integer,
  ........
 CONSTRAINT cs_cpt FOREIGN KEY (cpt_recid)
   REFERENCES cpt (recid) MATCH SIMPLE
   ON UPDATE NO ACTION ON DELETE RESTRICT,
 ...
);

The biggest problem I'm having is how to obtain the record which failed so that I can use it in the EXCEPTION clause to move the children rows from lab to one acceptable key, then loop back through and delete the unnecessary records from the cpt table.

Here is the (very wrong) code:

CREATE OR REPLACE FUNCTION h_RemoveDuplicateCPT()
  RETURNS void AS
$BODY$
BEGIN
LOOP
   BEGIN

   DELETE FROM cpt
   WHERE recid IN (
      SELECT recid
      FROM  (
         SELECT recid,
         row_number() over (partition BY cdesc ORDER BY recid) AS rnum
         FROM cpt) t
      WHERE t.rnum > 1)
   RETURNING recid;

   IF count = 0 THEN
      RETURN;
   END IF;  

   EXCEPTION WHEN foreign_key_violation THEN
      RAISE NOTICE 'fixing unique_violation';
      RAISE NOTICE 'recid is %' , recid;
   END;
END LOOP;
END;                    
$BODY$
LANGUAGE plpgsql VOLATILE;
like image 745
Alan Wayne Avatar asked Jun 20 '15 20:06

Alan Wayne


1 Answers

You can do this much more efficiently with a single SQL statement with data-modifying CTEs.

WITH plan AS (
   SELECT *
   FROM  (
      SELECT recid, min(recid) OVER (PARTITION BY cdesc) AS master_recid
      FROM   cpt
      ) sub
   WHERE  recid <> master_recid  -- ... <> self
   )
 , upd_lab AS (
   UPDATE lab l
   SET    cpt_recid = p.master_recid   -- link to master recid ...
   FROM   plan p
   WHERE  l.cpt_recid = p.recid
   )
DELETE FROM cpt c
USING  plan p
WHERE  c.recid = p.recid
RETURNING c.recid;

db<>fiddle here
Old sqlfiddle

Much faster and cleaner. Looping is comparatively expensive, exception handling is comparatively even more expensive.
More importantly, references in lab are redirected to the respective master row in cpt automatically, which wasn't in your original code, yet. So you can delete all dupes at once.

You can still wrap this in a PL/pgSQL or SQL function if you like.

Explanation

  1. In the 1st CTE plan, identify a master row in each partition with the same cdesc. In your case the row with the minimum recid.

  2. In the 2nd CTE upd_lab redirect all rows referencing a dupe to the master row in cpt.

  3. Finally, delete dupes, which is not going to raise exceptions because depending rows are being linked to the remaining master row virtually at the same time.

ON DELETE RESTRICT

All CTEs and the main query of a statement operate on the same snapshot of underlying tables, virtually concurrently. They don't see each others' effects on underlying tables:

  • Delete parent if it's not referenced by any other child

One might expect a FK constraint with ON DELETE RESTRICT to raise exceptions because, per documentation:

Referential actions other than the NO ACTION check cannot be deferred, even if the constraint is declared deferrable.

However, the above statement is a single command and, the manual again:

A constraint that is not deferrable will be checked immediately after every command.

Bold emphasis mine. Works for the less restrictive default ON DELETE NO ACTION too, of course.

But be wary of concurrent transactions writing to the same tables, but that's a general consideration, not specific to this task.

An exception applies for UNIQUE and PRIMARY KEY constraint, but that does not concern this case:

  • Constraint defined DEFERRABLE INITIALLY IMMEDIATE is still DEFERRED?
like image 175
Erwin Brandstetter Avatar answered Sep 21 '22 20:09

Erwin Brandstetter