After searching stackoverflow.com I found several questions asking how to remove duplicates, but none of them addressed speed.
In my case I have a table with 10 columns that contains 5 million exact row duplicates. In addition, I have at least a million other rows with duplicates in 9 of the 10 columns. My current technique is taking (so far) 3 hours to delete these 5 million rows. Here is my process:
-- Step 1: **This step took 13 minutes.** Insert only one of the n duplicate rows into a temp table
select
MAX(prikey) as MaxPriKey, -- identity(1, 1)
a,
b,
c,
d,
e,
f,
g,
h,
i
into #dupTemp
FROM sourceTable
group by
a,
b,
c,
d,
e,
f,
g,
h,
i
having COUNT(*) > 1
Next,
-- Step 2: **This step is taking the 3+ hours**
-- delete the row when all the non-unique columns are the same (duplicates) and
-- have a smaller prikey not equal to the max prikey
delete
from sourceTable
from sourceTable
inner join #dupTemp on
sourceTable.a = #dupTemp.a and
sourceTable.b = #dupTemp.b and
sourceTable.c = #dupTemp.c and
sourceTable.d = #dupTemp.d and
sourceTable.e = #dupTemp.e and
sourceTable.f = #dupTemp.f and
sourceTable.g = #dupTemp.g and
sourceTable.h = #dupTemp.h and
sourceTable.i = #dupTemp.i and
sourceTable.PriKey != #dupTemp.MaxPriKey
Any tips on how to speed this up, or a faster way? Remember I will have to run this again for rows that are not exact duplicates.
Thanks so much.
UPDATE:
I had to stop step 2 from running at the 9 hour mark.
I tried OMG Ponies' method and it finished after only 40 minutes.
I tried my step 2 with Andomar's batch delete, it ran the 9 hours before I stopped it.
UPDATE:
Ran a similar query with one less field to get rid of a different set of duplicates and the query ran for only 4 minutes (8000 rows) using OMG Ponies' method.
I will try the cte technique the next chance I get, however, I suspect OMG Ponies' method will be tough to beat.
Sets, built-in functions, and iterative methods can be used to remove duplicates from a list.
To remove duplicate values, click Data > Data Tools > Remove Duplicates. To highlight unique or duplicate values, use the Conditional Formatting command in the Style group on the Home tab.
To use a keyboard shortcut to access the Remove Duplicates command on the Data tab on the Ribbon, press Alt > A > M (press Alt, then A, then M).
What about EXISTS:
DELETE FROM sourceTable
WHERE EXISTS(SELECT NULL
FROM #dupTemp dt
WHERE sourceTable.a = dt.a
AND sourceTable.b = dt.b
AND sourceTable.c = dt.c
AND sourceTable.d = dt.d
AND sourceTable.e = dt.e
AND sourceTable.f = dt.f
AND sourceTable.g = dt.g
AND sourceTable.h = dt.h
AND sourceTable.i = dt.i
AND sourceTable.PriKey < dt.MaxPriKey)
Can you afford to have the original table unavailable for a short time?
I think the fastest solution is to create a new table without the duplicates. Basically the approach that you use with the temp table, but creating a "regular" table instead.
Then drop the original table and rename the intermediate table to have the same name as the old table.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With