Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing duplicate SQL records to permit a unique key

Tags:

sql

join

mysql

I have a table ('sales') in a MYSQL DB which should rightfully have had a unique constraint enforced to prevent duplicates. To first remove the dupes and set the constraint is proving a bit tricky.

Table structure (simplified):

  • 'id (unique, autoinc)'
  • product_id

The goal is to enforce uniqueness for product_id. The de-duping policy I want to apply is to remove all duplicate records except the most recently created, eg: the highest id.

Or to put another way, I would like to delete only duplicate records, excluding the ids matched by the following query whilst also preserving the existing non-duped records:

select id 
  from sales s  
inner join (select product_id, 
                   max(id) as maxId 
              from sales 
          group by product_id 
            having count(product_id)  > 1) groupedByProdId on s.product_id 
                                                          and s.id = groupedByProdId.maxId

I've struggled with this on two fronts - writing the query to select the correct records to delete and then also the constraint in MYSQL where a subselect FROM clause of a DELETE cannot reference the same table from which data is being removed.

I checked out this answer and it seemed to deal with the subject, but seem specific to sql-server, though I wouldn't rule this question out from duplicating another.

like image 393
j pimmel Avatar asked May 18 '10 22:05

j pimmel


2 Answers

In reply to your comment, here's a query that works in MySQL:

delete YourTable
from YourTable
inner join YourTable yt2
on YourTable.product_id = yt2.product_id
and YourTable.id < yt2.id

This would only remove duplicate rows. The inner join will filter out the latest row for each product, even if no other rows for the same product exist.

P.S. If you try to alias the table after FROM, MySQL requires you to specify the name of the database, like:

delete <DatabaseName>.yt
from YourTable yt
inner join YourTable yt2
on yt.product_id = yt2.product_id
and yt.id < yt2.id;
like image 181
Andomar Avatar answered Sep 22 '22 21:09

Andomar


Perhaps use ALTER IGNORE TABLE ... ADD UNIQUE KEY. For example:

describe sales;
+------------+---------+------+-----+---------+----------------+
| Field      | Type    | Null | Key | Default | Extra          |
+------------+---------+------+-----+---------+----------------+
| id         | int(11) | NO   | PRI | NULL    | auto_increment | 
| product_id | int(11) | NO   |     | NULL    |                | 
+------------+---------+------+-----+---------+----------------+

select * from sales;
+----+------------+
| id | product_id |
+----+------------+
|  1 |          1 | 
|  2 |          1 | 
|  3 |          2 | 
|  4 |          3 | 
|  5 |          3 | 
|  6 |          2 | 
+----+------------+

ALTER IGNORE TABLE sales ADD UNIQUE KEY idx1(product_id), ORDER BY id DESC; 
Query OK, 6 rows affected (0.03 sec)
Records: 6  Duplicates: 3  Warnings: 0


select * from sales;
+----+------------+
| id | product_id |
+----+------------+
|  6 |          2 | 
|  5 |          3 | 
|  2 |          1 | 
+----+------------+

See this pythian post for more information.

Note that the ids end up in reverse order. I don't think this matters, since order of the ids should not matter in a database (as far as I know!). If this displeases you however, the post linked to above shows a way to solve this problem too. However, it involves creating a temporary table which requires more hard drive space than the in-place method I posted above.

like image 30
unutbu Avatar answered Sep 23 '22 21:09

unutbu