I have the following structure for the table <code>DataTable</code>: every column is of the datatype int, <code>RowID</code> is an identity column and the primary key. <code>LinkID</code> is a foreign key and links to rows of an other table. <pre class="prettyprint"><code>RowID LinkID Order Data DataSpecifier 1 120 1 1 1 2 120 2 1 3 3 120 3 1 10 4 120 4 1 13 5 120 5 1 10 6 120 6 1 13 7 371 1 6 2 8 371 2 3 5 9 371 3 8 1 10 371 4 10 1 11 371 5 7 2 12 371 6 3 3 13 371 7 7 2 14 371 8 17 4 ................................. ................................. </code></pre> I'm trying to do a query which alters every <code>LinkID</code> batch in the following way: <ul> <li>Take every row with same <code>LinkID</code> (e.g. the first batch is the first 6 rows here)</li> <li>Order them by the <code>Order</code> column</li> <li>Look at <code>Data</code> and <code>DataSpecifier</code> columns as one compare unit (They can be thought as one column, called <code>dataunit</code>): <ul> <li>Keep as many rows from <code>Order</code> 1 onwards, until a duplicate <code>dataunit</code> comes by</li> <li>Delete every row from that first duplicate onwards for that <code>LinkID</code> </li> </ul> </li> </ul> So for the <code>LinkID</code> <code>120</code>: <ul> <li>Sort the batch (already sorted here, but should still do it)</li> <li>Start looking from the top (So <code>Order=1</code> here), go as long as you don't see a duplicate.</li> <li>Stop at the first duplicate <code>Order = 5</code> (<code>dataunit</code> <code>1 10</code> was already seen).</li> <li>Delete everything which has the <code>LinkID=120 AND Order>=5</code> </li> </ul> After similar process for <code>LinkID</code> <code>371</code> (and every other <code>LinkID</code> in the table), the processed table will look like this: <pre class="prettyprint"><code>RowID LinkID Order Data DataSpecifier 1 120 1 1 1 2 120 2 1 3 3 120 3 1 10 4 120 4 1 13 7 371 1 6 2 8 371 2 3 5 9 371 3 8 1 10 371 4 10 1 11 371 5 7 2 12 371 6 3 3 ................................. ................................. </code></pre> I've done quite a lot of SQL queries, but never something this complicated. I know I need to use a query which is something like this: <pre class="prettyprint"><code>DELETE FROM DataTable WHERE RowID IN (SELECT RowID FROM DataTable WHERE -- ? GROUP BY LinkID HAVING COUNT(*) > 1 -- ? ORDER BY [Order]); </code></pre> But I just can't seem to wrap my head around this and get the query right. I would preferably do this in pure SQL, with one executable (and reusable) query.

We can try using a CTE here to make things easier: <pre class="prettyprint"><code>WITH cte AS ( SELECT *, COUNT(*) OVER (PARTITION BY LinkID, Data, DataSpecifier ORDER BY [Order]) - 1 cnt FROM DataTable ), cte2 AS ( SELECT *, SUM(cnt) OVER (PARTITION BY LinkID ORDER BY [Order]) num FROM cte ) DELETE FROM cte WHERE num > 0; </code></pre> <img src="https://i.stack.imgur.com/GGpXt.png" alt="enter image description here"> The logic here is to use <code>COUNT</code> as an analytic function to identify the duplicate records. We use a partition of <code>LinkID</code> along with <code>Data</code> and <code>DataSpecifier</code>. Any record with an <code>Order</code> value greater than or equal to the first record with a non zero count is then targeted for deletion. Here is a demo showing that the logic of the CTE is correct: <h3>Demo</h3>

You can use the <code>ROW_NUMBER()</code> window function to identify any rows that come after the original. After that you can delete and rows with a matching <code>LinkID</code> and a greater than or equal to any encountered <code>Order</code> with a row number greater than one. (I originally used a second CTE to get the <code>MIN order</code>, but I realized that it wasn't necessary as long as the join to <code>order</code> was greater than equal to any <code>order</code> where there was a second instance of the DataUnitId. By removing the <code>MIN</code> the query plan became quite simple and efficient.) <pre class="prettyprint"><code>WITH DataUnitInstances AS ( SELECT * , ROW_NUMBER() OVER (PARTITION BY LinkID, [Data], [DataSpecifier] ORDER BY [Order]) DataUnitInstanceId FROM DataTable ) DELETE FROM DataTable FROM DataTable dt INNER JOIN DataUnitInstances dup ON dup.LinkID = dt.LinkID AND dup.[Order] <= dt.[Order] AND dup.DataUnitInstanceId > 1 </code></pre> Here is the output from your sample data which matches your desired result: <pre class="prettyprint"><code>+-------+--------+-------+------+---------------+ | RowID | LinkID | Order | Data | DataSpecifier | +-------+--------+-------+------+---------------+ | 1 | 120 | 1 | 1 | 1 | | 2 | 120 | 2 | 1 | 3 | | 3 | 120 | 3 | 1 | 10 | | 4 | 120 | 4 | 1 | 13 | | 7 | 371 | 1 | 6 | 2 | | 8 | 371 | 2 | 3 | 5 | | 9 | 371 | 3 | 8 | 1 | | 10 | 371 | 4 | 10 | 1 | | 11 | 371 | 5 | 7 | 2 | | 12 | 371 | 6 | 3 | 3 | +-------+--------+-------+------+---------------+ </code></pre>

How to remove rest of the rows with the same ID starting from the first duplicate?

Tags:

I have the following structure for the table DataTable: every column is of the datatype int, RowID is an identity column and the primary key. LinkID is a foreign key and links to rows of an other table.

RowID   LinkID   Order  Data    DataSpecifier
1       120      1      1       1
2       120      2      1       3
3       120      3      1       10
4       120      4      1       13
5       120      5      1       10
6       120      6      1       13
7       371      1      6       2
8       371      2      3       5
9       371      3      8       1
10      371      4      10      1
11      371      5      7       2
12      371      6      3       3
13      371      7      7       2
14      371      8      17      4
.................................
.................................

I'm trying to do a query which alters every LinkID batch in the following way:

Take every row with same LinkID (e.g. the first batch is the first 6 rows here)
Order them by the Order column
Look at Data and DataSpecifier columns as one compare unit (They can be thought as one column, called dataunit):
- Keep as many rows from Order 1 onwards, until a duplicate dataunit comes by
- Delete every row from that first duplicate onwards for that LinkID

So for the LinkID 120:

Sort the batch (already sorted here, but should still do it)
Start looking from the top (So Order=1 here), go as long as you don't see a duplicate.
Stop at the first duplicate Order = 5 (dataunit 1 10 was already seen).
Delete everything which has the LinkID=120 AND Order>=5

After similar process for LinkID 371 (and every other LinkID in the table), the processed table will look like this:

RowID   LinkID   Order  Data    DataSpecifier
1       120      1      1       1
2       120      2      1       3
3       120      3      1       10
4       120      4      1       13
7       371      1      6       2
8       371      2      3       5
9       371      3      8       1
10      371      4      10      1
11      371      5      7       2
12      371      6      3       3
.................................
.................................

I've done quite a lot of SQL queries, but never something this complicated. I know I need to use a query which is something like this:

DELETE FROM DataTable  
WHERE RowID IN (SELECT RowID
                FROM DataTable
                WHERE -- ?
                GROUP BY LinkID
                HAVING COUNT(*) > 1 -- ?
                ORDER BY [Order]);

But I just can't seem to wrap my head around this and get the query right. I would preferably do this in pure SQL, with one executable (and reusable) query.

614

asked May 08 '19 13:05

ruohola

2 Answers

We can try using a CTE here to make things easier:

WITH cte AS (
    SELECT *,
        COUNT(*) OVER (PARTITION BY LinkID, Data, DataSpecifier ORDER BY [Order]) - 1 cnt
    FROM DataTable
),
cte2 AS (
    SELECT *,
        SUM(cnt) OVER (PARTITION BY LinkID ORDER BY [Order]) num
    FROM cte
)

DELETE
FROM cte
WHERE num > 0;

enter image description here

The logic here is to use COUNT as an analytic function to identify the duplicate records. We use a partition of LinkID along with Data and DataSpecifier. Any record with an Order value greater than or equal to the first record with a non zero count is then targeted for deletion.

Here is a demo showing that the logic of the CTE is correct:

Demo

122

answered Oct 11 '22 02:10

Tim Biegeleisen

You can use the ROW_NUMBER() window function to identify any rows that come after the original. After that you can delete and rows with a matching LinkID and a greater than or equal to any encountered Order with a row number greater than one.

(I originally used a second CTE to get the MIN order, but I realized that it wasn't necessary as long as the join to order was greater than equal to any order where there was a second instance of the DataUnitId. By removing the MIN the query plan became quite simple and efficient.)

WITH DataUnitInstances AS (
  SELECT *
    , ROW_NUMBER() OVER
      (PARTITION BY LinkID, [Data], [DataSpecifier] ORDER BY [Order]) DataUnitInstanceId
  FROM DataTable
)
DELETE FROM DataTable
FROM DataTable dt
INNER JOIN DataUnitInstances dup ON dup.LinkID = dt.LinkID 
  AND dup.[Order] <= dt.[Order]
  AND dup.DataUnitInstanceId > 1

Here is the output from your sample data which matches your desired result:

+-------+--------+-------+------+---------------+
| RowID | LinkID | Order | Data | DataSpecifier |
+-------+--------+-------+------+---------------+
| 1     | 120    | 1     | 1    | 1             |
| 2     | 120    | 2     | 1    | 3             |
| 3     | 120    | 3     | 1    | 10            |
| 4     | 120    | 4     | 1    | 13            |
| 7     | 371    | 1     | 6    | 2             |
| 8     | 371    | 2     | 3    | 5             |
| 9     | 371    | 3     | 8    | 1             |
| 10    | 371    | 4     | 10   | 1             |
| 11    | 371    | 5     | 7    | 2             |
| 12    | 371    | 6     | 3    | 3             |
+-------+--------+-------+------+---------------+

answered Oct 11 '22 02:10

Daniel Gimenez

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to remove rest of the rows with the same ID starting from the first duplicate?

Tags:

ruohola

People also ask

2 Answers

Demo

Tim Biegeleisen

Daniel Gimenez

Recent Activity

Donate For Us

How to remove rest of the rows with the same ID starting from the first duplicate?

Tags:

ruohola

People also ask

2 Answers

Demo

Tim Biegeleisen

Daniel Gimenez

Related questions

Recent Activity

Donate For Us