Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest technique to deleting duplicate data

After searching stackoverflow.com I found several questions asking how to remove duplicates, but none of them addressed speed.

In my case I have a table with 10 columns that contains 5 million exact row duplicates. In addition, I have at least a million other rows with duplicates in 9 of the 10 columns. My current technique is taking (so far) 3 hours to delete these 5 million rows. Here is my process:

-- Step 1:  **This step took 13 minutes.** Insert only one of the n duplicate rows into a temp table
select
    MAX(prikey) as MaxPriKey, -- identity(1, 1)
    a,
    b,
    c,
    d,
    e,
    f,
    g,
    h,
    i
into #dupTemp
FROM sourceTable
group by
    a,
    b,
    c,
    d,
    e,
    f,
    g,
    h,
    i
having COUNT(*) > 1

Next,

-- Step 2: **This step is taking the 3+ hours**
-- delete the row when all the non-unique columns are the same (duplicates) and
-- have a smaller prikey not equal to the max prikey
delete 
from sourceTable
from sourceTable
inner join #dupTemp on  
    sourceTable.a = #dupTemp.a and
    sourceTable.b = #dupTemp.b and
    sourceTable.c = #dupTemp.c and
    sourceTable.d = #dupTemp.d and
    sourceTable.e   = #dupTemp.e and
    sourceTable.f = #dupTemp.f and
    sourceTable.g = #dupTemp.g and
    sourceTable.h = #dupTemp.h and
    sourceTable.i   = #dupTemp.i and
    sourceTable.PriKey != #dupTemp.MaxPriKey  

Any tips on how to speed this up, or a faster way? Remember I will have to run this again for rows that are not exact duplicates.

Thanks so much.

UPDATE:
I had to stop step 2 from running at the 9 hour mark. I tried OMG Ponies' method and it finished after only 40 minutes. I tried my step 2 with Andomar's batch delete, it ran the 9 hours before I stopped it. UPDATE: Ran a similar query with one less field to get rid of a different set of duplicates and the query ran for only 4 minutes (8000 rows) using OMG Ponies' method.

I will try the cte technique the next chance I get, however, I suspect OMG Ponies' method will be tough to beat.

like image 236
O.O Avatar asked Aug 17 '10 21:08

O.O


People also ask

Which methods is used to remove duplicates?

Sets, built-in functions, and iterative methods can be used to remove duplicates from a list.

Which tool will completely remove the duplicate record?

To remove duplicate values, click Data > Data Tools > Remove Duplicates. To highlight unique or duplicate values, use the Conditional Formatting command in the Style group on the Home tab.

What is the shortcut to remove duplicates in Excel?

To use a keyboard shortcut to access the Remove Duplicates command on the Data tab on the Ribbon, press Alt > A > M (press Alt, then A, then M).


2 Answers

What about EXISTS:

DELETE FROM sourceTable
 WHERE EXISTS(SELECT NULL
                FROM #dupTemp dt
               WHERE sourceTable.a = dt.a 
                 AND sourceTable.b = dt.b 
                 AND sourceTable.c = dt.c 
                 AND sourceTable.d = dt.d 
                 AND sourceTable.e = dt.e 
                 AND sourceTable.f = dt.f 
                 AND sourceTable.g = dt.g 
                 AND sourceTable.h = dt.h 
                 AND sourceTable.i = dt.i 
                 AND sourceTable.PriKey < dt.MaxPriKey)
like image 117
OMG Ponies Avatar answered Sep 30 '22 14:09

OMG Ponies


Can you afford to have the original table unavailable for a short time?

I think the fastest solution is to create a new table without the duplicates. Basically the approach that you use with the temp table, but creating a "regular" table instead.

Then drop the original table and rename the intermediate table to have the same name as the old table.

like image 36
a_horse_with_no_name Avatar answered Sep 30 '22 12:09

a_horse_with_no_name