Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select and/or delete all but one row of each set of duplicates in a table?

Let's say I have a MySQL table with four columns:

ID DRIVER_ID CAR_ID NOTES (NULL for most rows)

I have a bunch of duplicate rows where DRIVER_ID and CAR_ID are the same. For each pair of DRIVER_ID and CAR_ID, I want one row. If one of the rows in the set has non-NULL NOTES, I want that one, but otherwise it doesn't matter.

So, If I have:

ID DRIVER_ID CAR_ID NOTES
1 1 1 NULL
2 1 1 NULL
3 1 2 NULL
4 1 2 NULL
5 2 3 NULL
6 2 3 NULL
7 2 3 NULL
8 2 3 hi
9 3 5 NULL

I want to keep the following IDs: 9, 8, and then one each of [3,4] and [1,2].

It's a huge table, and the clunky methods I've tried are insanely slow, to the point where I'm sure I'm going about it all wrong. How can I efficiently:

  1. Select the list of IDs to delete?
  2. Delete them in the same query?

Some of the stuff I've tried so far:

select ID, COUNT(DRIVER_ID) rowcount from CARS_DRIVERS group by CAR_ID,DRIVER_ID HAVING rowcount > 1;

This will get me one ID per group, but it doesn't necessarily leave the row with NOTES if there is one, though. It will also only get me one ID per duplicate group. There are some cases where there are 20+ duplicate combinations so I would need to iterate that over and over to whittle each group down to a single row.

select distinct t1.ID from CARS_DRIVERS t1 where exists (select * from CARS_DRIVERS t2 where t2.CAR_ID = t1.CAR_ID and t2.DRIVER_ID = t1.DRIVER_ID and t2.id > t1.id);

This is much slower, and still doesn't really address the NOTES issue. It does have the advantage of getting the oldest row for each group, which, if I can't isolate on the NOTES field easily, could be a proxy for that. If a row in a set has NOTES, I believe it's always the oldest one (one with the lowest ID), but I'm not certain.

DRIVER_ID and CAR_ID are not the real column names, and there are other columns in the table. I was trying to distill down the info to get at the root of the problem, but I see from W4M's comment that this makes it look like a homework assignment. The real deal is that I'm looking at a very unoptimized database (not my purview normally) and when trying to get rid of these dupes before adding a key, the operation is taking forever - as in, hours.

The table is big but certainly doesn't justify that. I'm trying to pitch in with my limited SQL expertise and figure out a way to get this done. It doesn't matter if it's pretty, I can sit at the command line and brute-force a bunch of queries if necessary but I noticed that selecting IDs that are candidates for deletion only takes a few seconds, and although the table is huge, the total number of rows to delete is less than 10,000 so there must be a way to make this happen without some script that takes a whole weekend to finish.

like image 544
NChase Avatar asked Jun 24 '11 19:06

NChase


People also ask

How do you delete rows but keep one row?

Delete the duplicate rows but keep latest : using GROUP BY and MAX. One way to delete the duplicate rows but retaining the latest ones is by using MAX() function and GROUP BY clause.


1 Answers

Here's one solution. I tested this on MySQL 5.5.8.

SELECT MAX(COALESCE(c2.id, c1.id)) AS id,
 c1.driver_id, c1.car_id,
 c2.notes AS notes
FROM cars_drivers AS c1
LEFT OUTER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id) AND c2.notes IS NOT NULL
GROUP BY c1.driver_id, c1.car_id, c2.notes;

I include c2.notes as a GROUP BY key because you might have more than one row with non-null notes per values of driver_id,car_id.

Result using your example data:

+------+-----------+--------+-------+
| id   | driver_id | car_id | notes |
+------+-----------+--------+-------+
|    2 |         1 |      1 | NULL  |
|    4 |         2 |      1 | NULL  |
|    8 |         3 |      2 | hi    |
|    9 |         5 |      3 | NULL  |
+------+-----------+--------+-------+

Regarding deleting. In your example data, it's always the highest id value per driver_id & car_id that you want to keep. If you can depend on that, you can do a multi-table delete that deletes all rows for which a row with a higher id value and the same driver_id & car_id exists:

DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id) AND c1.id < c2.id;

This naturally skips any cases where only one row exists with a given pair of driver_id & car_id values, because the conditions of the inner join require two rows with different id values.

But if you can't depend on the latest id per group being the one you want to keep, the solution is more complex. It's probably more complex than it's worth to solve in one statement, so do it in two statements.

I tested this too, after adding a couple more rows for testing:

INSERT INTO cars_drivers VALUES (10,2,3,NULL), (11,2,3,'bye');

+----+--------+-----------+-------+
| id | car_id | driver_id | notes |
+----+--------+-----------+-------+
|  1 |      1 |         1 | NULL  |
|  2 |      1 |         1 | NULL  |
|  3 |      1 |         2 | NULL  |
|  4 |      1 |         2 | NULL  |
|  5 |      2 |         3 | NULL  |
|  6 |      2 |         3 | NULL  |
|  7 |      2 |         3 | NULL  |
|  8 |      2 |         3 | hi    |
|  9 |      3 |         5 | NULL  |
| 10 |      2 |         3 | NULL  |
| 11 |      2 |         3 | bye   |
+----+--------+-----------+-------+

First delete rows with null notes, where a row with non-null notes exists.

DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id)
WHERE c1.notes IS NULL AND c2.notes IS NOT NULL;

+----+--------+-----------+-------+
| id | car_id | driver_id | notes |
+----+--------+-----------+-------+
|  1 |      1 |         1 | NULL  |
|  2 |      1 |         1 | NULL  |
|  3 |      1 |         2 | NULL  |
|  4 |      1 |         2 | NULL  |
|  8 |      2 |         3 | hi    |
|  9 |      3 |         5 | NULL  |
| 11 |      2 |         3 | bye   |
+----+--------+-----------+-------+

Second, delete all but the highest-id row from each group of duplicates.

DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id) AND c1.id < c2.id;

+----+--------+-----------+-------+
| id | car_id | driver_id | notes |
+----+--------+-----------+-------+
|  2 |      1 |         1 | NULL  |
|  4 |      1 |         2 | NULL  |
|  9 |      3 |         5 | NULL  |
| 11 |      2 |         3 | bye   |
+----+--------+-----------+-------+
like image 65
Bill Karwin Avatar answered Oct 18 '22 08:10

Bill Karwin