Delete duplicates from large dataset (>100Mio rows)

Tags:

I know that this topic came up many times before here but none of the suggested solutions worked for my dataset because my laptop stopped calculating due to memory issues or full storage.

My table looks like the following and has 108 Mio rows:

Col1       |Col2   |  Col3           |Col4   |SICComb |  NameComb 

Case New   |3523   |  Alexander      |6799   |67993523| AlexanderCase New 
Case New   |3523   |  Undisclosed    |6799   |67993523| Case NewUndisclosed 
Undisclosed|6799   |  Case New       |3523   |67993523| Case NewUndisclosed 
Case New   |3523   |  Undisclosed    |6799   |67993523| Case NewUndisclosed 
SmartCard  |3674   |  NEC            |7373   |73733674| NECSmartCard 
SmartCard  |3674   |  Virtual NetComm|7373   |73733674| SmartCardVirtual NetComm 
SmartCard  |3674   |  NEC            |7373   |73733674| NECSmartCard

The unique columns are SICComb and NameComb. I tried to add a primary key with:

ALTER TABLE dbo.test ADD ID INT IDENTITY(1,1)

but the integers are filling up more than 30 GB of my storage just in a new minutes.

Which would be the fastest and most efficient method to delete the duplicates from the table?

917

asked Aug 24 '13 11:08

user2713440

2 Answers

If you're using SQL Server, you can use delete from common table expression:

with cte as (
    select row_number() over(partition by SICComb, NameComb order by Col1) as row_num
    from Table1
)
delete
from cte
where row_num > 1

Here all rows will be numbered, you get own sequence for each unique combination of SICComb + NameComb. You can choose which rows you want to delete by choosing order by inside the over clause.

182

answered Oct 23 '22 05:10

Roman Pekar

In general, the fastest way to delete duplicates from a table is to insert the records -- without duplicates -- into a temporary table, truncate the original table and insert them back in.

Here is the idea, using SQL Server syntax:

select distinct t.*
into #temptable
from t;

truncate table t;

insert into t
    select tt.*
    from #temptable;

Of course, this depends to a large extent on how fast the first step is. And, you need to have the space to store two copies of the same table.

Note that the syntax for creating the temporary table differs among databases. Some use the syntax of create table as rather than select into.

EDIT:

Your identity insert error is troublesome. I think you need to remove the identity from the list of columns for the distinct. Or do:

select min(<identity col>), <all other columns>
from t
group by <all other columns>

If you have an identity column, then there are no duplicates (by definition).

In the end, you will need to decide which id you want for the rows. If you can generate a new id for the rows, then just leave the identity column out of the column list for the insert:

insert into t(<all other columns>)
    select <all other columns>;

If you need the old identity value (and the minimum will do), turn off identity insert and do:

insert into t(<all columns including identity>)
    select <all columns including identity>;

answered Oct 23 '22 03:10

Gordon Linoff

Related questions
                            
                                How do I grab multiple outputs from a stored procedure into temp table
                            
                                "Timeout expired" when executing Copy Database Wizard package
                            
                                For each row from select insert multiple rows into one table
                            
                                Proper way to make SQL queries in code? [closed]
                            
                                Purpose of checking @@rowcount = 0 in a trigger?
                            
                                Prevent accidental updates/deletes on any table
                            
                                sql to take maximum number from both the tables?
                            
                                SQL - How to show the difference between multiple rows results
                            
                                Why is insert trigger called when there is no data to insert?
                            
                                Where I found out milliseconds for the execution of a query in SQL Server 2008 management studio?
                            
                                What is the alternative for number(2,0) of oracle in sqlserver?
                            
                                TRY/CATCH does not work on SQL Server Agent error?
                            
                                Paging of records using sql server stored procedure
                            
                                Is it possible to find out who called a stored procedure in SQL Server 2005
                            
                                T-SQL to find if a Database is Subscribed on the Subscriber in Transactional Replication
                            
                                find the count using group by
                            
                                Sql Grouping by on a range of year
                            
                                SQL71561 error - ambiguous reference
                            
                                uncoditioned Count(*) vs System tables
                            
                                Getting a single value from SQL with ADO.NET

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Delete duplicates from large dataset (>100Mio rows)

Tags:

sql-server

tsql

duplicates

sql-server-2008

user2713440

People also ask

2 Answers

Roman Pekar

Gordon Linoff

Recent Activity

Donate For Us