Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete duplicate rows from small table

People also ask

How do I remove duplicate rows from student table?

The basic syntax to eliminate duplicate records from a table is: SELECT DISTINCT column1, column2,.... columnN. FROM table _name.

Can you delete duplicate rows in a table without using a temporary table?

You can't. Fully duplicate records (ones which have every field equal) can't be deleted by one, because you can't differentiate them in a Where clause in a delete query. The only way would be to do a select distinct query to select all rows without duplicates, then insert them in an empty table.


A faster solution is

DELETE FROM dups a USING (
      SELECT MIN(ctid) as ctid, key
        FROM dups 
        GROUP BY key HAVING COUNT(*) > 1
      ) b
      WHERE a.key = b.key 
      AND a.ctid <> b.ctid

DELETE FROM dupes a
WHERE a.ctid <> (SELECT min(b.ctid)
                 FROM   dupes b
                 WHERE  a.key = b.key);

This is fast and concise:

DELETE FROM dupes T1
    USING   dupes T2
WHERE   T1.ctid < T2.ctid  -- delete the older versions
    AND T1.key  = T2.key;  -- add more columns if needed

See also my answer at How to delete duplicate rows without unique identifier which includes more information.


I tried this:

DELETE FROM tablename
WHERE id IN (SELECT id
              FROM (SELECT id,
                             ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
                     FROM tablename) t
              WHERE t.rnum > 1);

provided by Postgres wiki:

https://wiki.postgresql.org/wiki/Deleting_duplicates


EXISTS is simple and among the fastest for most data distributions:

DELETE FROM dupes d
WHERE  EXISTS (
   SELECT FROM dupes
   WHERE  key = d.key
   AND    ctid < d.ctid
   );

From each set of duplicate rows (defined by identical key), this keeps the one row with the minimum ctid.

Result is identical to the currently accepted answer by a_horse. Just faster, because EXISTS can stop evaluating as soon as the first offending row is found, while the alternative with min() has to consider all rows per group to compute the minimum. Speed is of no concern to this question, but why not take it?

You may want to add a UNIQUE constraint after cleaning up, to prevent duplicates from creeping back in:

ALTER TABLE dupes ADD CONSTRAINT constraint_name_here UNIQUE (key);

About the system column ctid:

  • Is the system column “ctid” legitimate for identifying rows to delete?

If there is any other column defined UNIQUE NOT NULL column in the table (like a PRIMARY KEY) then, by all means, use it instead of ctid.

If key can be NULL and you only want one of those, too, use IS NOT DISTINCT FROM instead of =. See:

  • How do I (or can I) SELECT DISTINCT on multiple columns?

As that's slower, you might instead run the above query as is, and this in addition:

DELETE FROM dupes d
WHERE  key IS NULL
AND    EXISTS (
   SELECT FROM dupes
   WHERE  key IS NULL
   AND    ctid < d.ctid
   );

And consider:

  • Create unique constraint with null columns

For small tables, indexes generally do not help performance. And we need not look further.

For big tables and few duplicates, an existing index on (key) can help (a lot).

For mostly duplicates, an index may add more cost than benefit, as it has to be kept up to date concurrently. Finding duplicates without index becomes faster anyway because there are so many and EXISTS only needs to find one. But consider a completely different approach if you can afford it (i.e. concurrent access allows it): Write the few surviving rows to a new table. That also removes table (and index) bloat in the process. See:

  • How to delete duplicate entries?