How to select and/or delete all but one row of each set of duplicates in a table?

Tags:

duplicates

Let's say I have a MySQL table with four columns:

ID	DRIVER_ID	CAR_ID	NOTES (NULL for most rows)

I have a bunch of duplicate rows where DRIVER_ID and CAR_ID are the same. For each pair of DRIVER_ID and CAR_ID, I want one row. If one of the rows in the set has non-NULL NOTES, I want that one, but otherwise it doesn't matter.

So, If I have:

ID	DRIVER_ID	CAR_ID	NOTES
1	1	1	NULL
2	1	1	NULL
3	1	2	NULL
4	1	2	NULL
5	2	3	NULL
6	2	3	NULL
7	2	3	NULL
8	2	3	hi
9	3	5	NULL

I want to keep the following IDs: 9, 8, and then one each of [3,4] and [1,2].

It's a huge table, and the clunky methods I've tried are insanely slow, to the point where I'm sure I'm going about it all wrong. How can I efficiently:

Select the list of IDs to delete?
Delete them in the same query?

Some of the stuff I've tried so far:

select ID, COUNT(DRIVER_ID) rowcount from CARS_DRIVERS group by CAR_ID,DRIVER_ID HAVING rowcount > 1;

This will get me one ID per group, but it doesn't necessarily leave the row with NOTES if there is one, though. It will also only get me one ID per duplicate group. There are some cases where there are 20+ duplicate combinations so I would need to iterate that over and over to whittle each group down to a single row.

select distinct t1.ID from CARS_DRIVERS t1 where exists (select * from CARS_DRIVERS t2 where t2.CAR_ID = t1.CAR_ID and t2.DRIVER_ID = t1.DRIVER_ID and t2.id > t1.id);

This is much slower, and still doesn't really address the NOTES issue. It does have the advantage of getting the oldest row for each group, which, if I can't isolate on the NOTES field easily, could be a proxy for that. If a row in a set has NOTES, I believe it's always the oldest one (one with the lowest ID), but I'm not certain.

DRIVER_ID and CAR_ID are not the real column names, and there are other columns in the table. I was trying to distill down the info to get at the root of the problem, but I see from W4M's comment that this makes it look like a homework assignment. The real deal is that I'm looking at a very unoptimized database (not my purview normally) and when trying to get rid of these dupes before adding a key, the operation is taking forever - as in, hours.

The table is big but certainly doesn't justify that. I'm trying to pitch in with my limited SQL expertise and figure out a way to get this done. It doesn't matter if it's pretty, I can sit at the command line and brute-force a bunch of queries if necessary but I noticed that selecting IDs that are candidates for deletion only takes a few seconds, and although the table is huge, the total number of rows to delete is less than 10,000 so there must be a way to make this happen without some script that takes a whole weekend to finish.

544

asked Jun 24 '11 19:06

NChase

1 Answers

Here's one solution. I tested this on MySQL 5.5.8.

SELECT MAX(COALESCE(c2.id, c1.id)) AS id,
 c1.driver_id, c1.car_id,
 c2.notes AS notes
FROM cars_drivers AS c1
LEFT OUTER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id) AND c2.notes IS NOT NULL
GROUP BY c1.driver_id, c1.car_id, c2.notes;

I include c2.notes as a GROUP BY key because you might have more than one row with non-null notes per values of driver_id,car_id.

Result using your example data:

+------+-----------+--------+-------+
| id   | driver_id | car_id | notes |
+------+-----------+--------+-------+
|    2 |         1 |      1 | NULL  |
|    4 |         2 |      1 | NULL  |
|    8 |         3 |      2 | hi    |
|    9 |         5 |      3 | NULL  |
+------+-----------+--------+-------+

Regarding deleting. In your example data, it's always the highest id value per driver_id & car_id that you want to keep. If you can depend on that, you can do a multi-table delete that deletes all rows for which a row with a higher id value and the same driver_id & car_id exists:

DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id) AND c1.id < c2.id;

This naturally skips any cases where only one row exists with a given pair of driver_id & car_id values, because the conditions of the inner join require two rows with different id values.

But if you can't depend on the latest id per group being the one you want to keep, the solution is more complex. It's probably more complex than it's worth to solve in one statement, so do it in two statements.

I tested this too, after adding a couple more rows for testing:

INSERT INTO cars_drivers VALUES (10,2,3,NULL), (11,2,3,'bye');

+----+--------+-----------+-------+
| id | car_id | driver_id | notes |
+----+--------+-----------+-------+
|  1 |      1 |         1 | NULL  |
|  2 |      1 |         1 | NULL  |
|  3 |      1 |         2 | NULL  |
|  4 |      1 |         2 | NULL  |
|  5 |      2 |         3 | NULL  |
|  6 |      2 |         3 | NULL  |
|  7 |      2 |         3 | NULL  |
|  8 |      2 |         3 | hi    |
|  9 |      3 |         5 | NULL  |
| 10 |      2 |         3 | NULL  |
| 11 |      2 |         3 | bye   |
+----+--------+-----------+-------+

First delete rows with null notes, where a row with non-null notes exists.

DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id)
WHERE c1.notes IS NULL AND c2.notes IS NOT NULL;

+----+--------+-----------+-------+
| id | car_id | driver_id | notes |
+----+--------+-----------+-------+
|  1 |      1 |         1 | NULL  |
|  2 |      1 |         1 | NULL  |
|  3 |      1 |         2 | NULL  |
|  4 |      1 |         2 | NULL  |
|  8 |      2 |         3 | hi    |
|  9 |      3 |         5 | NULL  |
| 11 |      2 |         3 | bye   |
+----+--------+-----------+-------+

Second, delete all but the highest-id row from each group of duplicates.

DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id) AND c1.id < c2.id;

+----+--------+-----------+-------+
| id | car_id | driver_id | notes |
+----+--------+-----------+-------+
|  2 |      1 |         1 | NULL  |
|  4 |      1 |         2 | NULL  |
|  9 |      3 |         5 | NULL  |
| 11 |      2 |         3 | bye   |
+----+--------+-----------+-------+

answered Oct 18 '22 08:10

Bill Karwin

Related questions
                            
                                how to get the size of a record in mysql
                            
                                Free tool to watch database for changes?
                            
                                Hibernate Communications Link Failure in Hibernate Based Java Servlet application powered by MySQL
                            
                                SQL, counting from 0 to 10
                            
                                MySQL Keyword Search Across Multiple Tables
                            
                                Accessing a XAMPP mysql via Python
                            
                                MySqlDataReader stops reading after about 4 minutes and returns
                            
                                Create field with default as epoch time
                            
                                Returning multiple columns from SELECT nested within CASE MySQL
                            
                                Select adjacent records in Mysql
                            
                                MySql Temp Tables VS Views VS php arrays
                            
                                How Database stores data internally in B-Tree/B+Tree
                            
                                Best way to store and retrieve big video files for a web based app? [closed]
                            
                                Tools for visualizing mysql explain [closed]
                            
                                Should 1-time use data like verification code to be stored in the same table
                            
                                Handling innoDB deadlock
                            
                                Reverse of ON DELETE CASCADE
                            
                                Is it bad using $_SESSION['id'] in queries?
                            
                                How to read innodb log file?
                            
                                Mapping Java byte[] to MySQL binary(64) in Hibernate

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With