Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove rest of the rows with the same ID starting from the first duplicate?

Tags:

I have the following structure for the table DataTable: every column is of the datatype int, RowID is an identity column and the primary key. LinkID is a foreign key and links to rows of an other table.

RowID   LinkID   Order  Data    DataSpecifier
1       120      1      1       1
2       120      2      1       3
3       120      3      1       10
4       120      4      1       13
5       120      5      1       10
6       120      6      1       13
7       371      1      6       2
8       371      2      3       5
9       371      3      8       1
10      371      4      10      1
11      371      5      7       2
12      371      6      3       3
13      371      7      7       2
14      371      8      17      4
.................................
.................................

I'm trying to do a query which alters every LinkID batch in the following way:

  • Take every row with same LinkID (e.g. the first batch is the first 6 rows here)
  • Order them by the Order column
  • Look at Data and DataSpecifier columns as one compare unit (They can be thought as one column, called dataunit):
    • Keep as many rows from Order 1 onwards, until a duplicate dataunit comes by
    • Delete every row from that first duplicate onwards for that LinkID

So for the LinkID 120:

  • Sort the batch (already sorted here, but should still do it)
  • Start looking from the top (So Order=1 here), go as long as you don't see a duplicate.
  • Stop at the first duplicate Order = 5 (dataunit 1 10 was already seen).
  • Delete everything which has the LinkID=120 AND Order>=5

After similar process for LinkID 371 (and every other LinkID in the table), the processed table will look like this:

RowID   LinkID   Order  Data    DataSpecifier
1       120      1      1       1
2       120      2      1       3
3       120      3      1       10
4       120      4      1       13
7       371      1      6       2
8       371      2      3       5
9       371      3      8       1
10      371      4      10      1
11      371      5      7       2
12      371      6      3       3
.................................
.................................

I've done quite a lot of SQL queries, but never something this complicated. I know I need to use a query which is something like this:

DELETE FROM DataTable  
WHERE RowID IN (SELECT RowID
                FROM DataTable
                WHERE -- ?
                GROUP BY LinkID
                HAVING COUNT(*) > 1 -- ?
                ORDER BY [Order]);

But I just can't seem to wrap my head around this and get the query right. I would preferably do this in pure SQL, with one executable (and reusable) query.

like image 614
ruohola Avatar asked May 08 '19 13:05

ruohola


People also ask

How do you delete the rows which are duplicate don't delete both duplicate records?

SQL Delete Duplicate Rows using Group By and Having Clause According to Delete Duplicate Rows in SQL, for finding duplicate rows, you need to use the SQL GROUP BY clause. The COUNT function can be used to verify the occurrence of a row using the Group by clause, which groups data according to the given columns.

Does removing duplicates in Excel remove the entire row?

This method will introduce the Remove Duplicates feature to remove entire rows based on duplicates in one column easily in Excel. 1. Select the range you will delete rows based on duplicates in one column, and then click Data > Remove Duplicates.


2 Answers

We can try using a CTE here to make things easier:

WITH cte AS (
    SELECT *,
        COUNT(*) OVER (PARTITION BY LinkID, Data, DataSpecifier ORDER BY [Order]) - 1 cnt
    FROM DataTable
),
cte2 AS (
    SELECT *,
        SUM(cnt) OVER (PARTITION BY LinkID ORDER BY [Order]) num
    FROM cte
)

DELETE
FROM cte
WHERE num > 0;

enter image description here

The logic here is to use COUNT as an analytic function to identify the duplicate records. We use a partition of LinkID along with Data and DataSpecifier. Any record with an Order value greater than or equal to the first record with a non zero count is then targeted for deletion.

Here is a demo showing that the logic of the CTE is correct:

Demo

like image 122
Tim Biegeleisen Avatar answered Oct 11 '22 02:10

Tim Biegeleisen


You can use the ROW_NUMBER() window function to identify any rows that come after the original. After that you can delete and rows with a matching LinkID and a greater than or equal to any encountered Order with a row number greater than one.

(I originally used a second CTE to get the MIN order, but I realized that it wasn't necessary as long as the join to order was greater than equal to any order where there was a second instance of the DataUnitId. By removing the MIN the query plan became quite simple and efficient.)

WITH DataUnitInstances AS (
  SELECT *
    , ROW_NUMBER() OVER
      (PARTITION BY LinkID, [Data], [DataSpecifier] ORDER BY [Order]) DataUnitInstanceId
  FROM DataTable
)
DELETE FROM DataTable
FROM DataTable dt
INNER JOIN DataUnitInstances dup ON dup.LinkID = dt.LinkID 
  AND dup.[Order] <= dt.[Order]
  AND dup.DataUnitInstanceId > 1

Here is the output from your sample data which matches your desired result:

+-------+--------+-------+------+---------------+
| RowID | LinkID | Order | Data | DataSpecifier |
+-------+--------+-------+------+---------------+
| 1     | 120    | 1     | 1    | 1             |
| 2     | 120    | 2     | 1    | 3             |
| 3     | 120    | 3     | 1    | 10            |
| 4     | 120    | 4     | 1    | 13            |
| 7     | 371    | 1     | 6    | 2             |
| 8     | 371    | 2     | 3    | 5             |
| 9     | 371    | 3     | 8    | 1             |
| 10    | 371    | 4     | 10   | 1             |
| 11    | 371    | 5     | 7    | 2             |
| 12    | 371    | 6     | 3    | 3             |
+-------+--------+-------+------+---------------+
like image 41
Daniel Gimenez Avatar answered Oct 11 '22 02:10

Daniel Gimenez