Fastest technique to deleting duplicate data

Tags:

After searching stackoverflow.com I found several questions asking how to remove duplicates, but none of them addressed speed.

In my case I have a table with 10 columns that contains 5 million exact row duplicates. In addition, I have at least a million other rows with duplicates in 9 of the 10 columns. My current technique is taking (so far) 3 hours to delete these 5 million rows. Here is my process:

-- Step 1:  **This step took 13 minutes.** Insert only one of the n duplicate rows into a temp table
select
    MAX(prikey) as MaxPriKey, -- identity(1, 1)
    a,
    b,
    c,
    d,
    e,
    f,
    g,
    h,
    i
into #dupTemp
FROM sourceTable
group by
    a,
    b,
    c,
    d,
    e,
    f,
    g,
    h,
    i
having COUNT(*) > 1

Next,

-- Step 2: **This step is taking the 3+ hours**
-- delete the row when all the non-unique columns are the same (duplicates) and
-- have a smaller prikey not equal to the max prikey
delete 
from sourceTable
from sourceTable
inner join #dupTemp on  
    sourceTable.a = #dupTemp.a and
    sourceTable.b = #dupTemp.b and
    sourceTable.c = #dupTemp.c and
    sourceTable.d = #dupTemp.d and
    sourceTable.e   = #dupTemp.e and
    sourceTable.f = #dupTemp.f and
    sourceTable.g = #dupTemp.g and
    sourceTable.h = #dupTemp.h and
    sourceTable.i   = #dupTemp.i and
    sourceTable.PriKey != #dupTemp.MaxPriKey

Any tips on how to speed this up, or a faster way? Remember I will have to run this again for rows that are not exact duplicates.

Thanks so much.

UPDATE:
I had to stop step 2 from running at the 9 hour mark. I tried OMG Ponies' method and it finished after only 40 minutes. I tried my step 2 with Andomar's batch delete, it ran the 9 hours before I stopped it. UPDATE: Ran a similar query with one less field to get rid of a different set of duplicates and the query ran for only 4 minutes (8000 rows) using OMG Ponies' method.

I will try the cte technique the next chance I get, however, I suspect OMG Ponies' method will be tough to beat.

236

asked Aug 17 '10 21:08

O.O

2 Answers

What about EXISTS:

DELETE FROM sourceTable
 WHERE EXISTS(SELECT NULL
                FROM #dupTemp dt
               WHERE sourceTable.a = dt.a 
                 AND sourceTable.b = dt.b 
                 AND sourceTable.c = dt.c 
                 AND sourceTable.d = dt.d 
                 AND sourceTable.e = dt.e 
                 AND sourceTable.f = dt.f 
                 AND sourceTable.g = dt.g 
                 AND sourceTable.h = dt.h 
                 AND sourceTable.i = dt.i 
                 AND sourceTable.PriKey < dt.MaxPriKey)

117

answered Sep 30 '22 14:09

OMG Ponies

Can you afford to have the original table unavailable for a short time?

I think the fastest solution is to create a new table without the duplicates. Basically the approach that you use with the temp table, but creating a "regular" table instead.

Then drop the original table and rename the intermediate table to have the same name as the old table.

answered Sep 30 '22 12:09

a_horse_with_no_name

Related questions
                            
                                Should web applications use explicit SQL transactions?
                            
                                Are Dynamic Prepared Statements Bad? (with php + mysqli)
                            
                                Performance of multi-column MySQL indexes when using only one column in a query
                            
                                Rows Into Columns and Grouping
                            
                                MS-SQL Bulk Insert with RODBC
                            
                                What tools are available to test SQL statement performance?
                            
                                Thread safe sql transaction, how to lock a specific row during a transaction?
                            
                                SQL Formatter using C#
                            
                                Sync between Sql Server and Mysql Server
                            
                                ignore insert of rows that violate duplicate key index
                            
                                MySQL Find the total amount of posts per user
                            
                                Fast way to determine if an field exist in a ORACLE table
                            
                                Which Oracle table uses a sequence?
                            
                                UNION on two select gives 'SQL Error: ORA-00907: missing right parenthesis'
                            
                                hibernate restrictions.in with and, how to use?
                            
                                Converting SQL with subselect in select to HQL
                            
                                SQL Server 2005 RIGHT OUTER JOIN not working
                            
                                How can I speed up queries against huge data warehouse tables with effective-dated data?
                            
                                Optimal MySQL temporary tables (memory tables) configuration?
                            
                                Ungrouping effect?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest technique to deleting duplicate data

Tags:

sql

sql-server

sql-server-2008

etl

O.O

People also ask

2 Answers

OMG Ponies

a_horse_with_no_name

Recent Activity

Donate For Us