Best self join technique when checking for duplicates

Tags:

i'm trying to optimize a query that is in production which is taking a long time. The goal is to find duplicate records based on matching field values criteria and then deleting them. The current query uses a self join via inner join on t1.col1 = t2.col1 then a where clause to check the values.

select * from table t1 
inner join table t2 on t1.col1 = t2.col1
where t1.col2 = t2.col2 ...

What would be a better way to do this? Or is it all the same based on indexes? Maybe

select * from table t1, table t2
where t1.col1 = t2.col1, t2.col2 = t2.col2 ...

this table has 100m+ rows.

MS SQL, SQL Server 2008 Enterprise

select distinct t2.id
    from table1 t1 with (nolock)
    inner join table1 t2 with (nolock) on  t1.ckid=t2.ckid
    left join table2 t3 on t1.cid = t3.cid and t1.typeid = t3.typeid
    where 
    t2.id > @Max_id and
    t2.timestamp > t1.timestamp and
    t2.rid = 2 and
    isnull(t1.col1,'') = isnull(t2.col1,'') and 
    isnull(t1.cid,-1) = isnull(t2.cid,-1) and
    isnull(t1.rid,-1) = isnull(t2.rid,-1)and 
    isnull(t1.typeid,-1) = isnull(t2.typeid,-1) and
    isnull(t1.cktypeid,-1) = isnull(t2.cktypeid,-1) and
    isnull(t1.oid,'') = isnull(t2.oid,'') and
    isnull(t1.stypeid,-1) = isnull(t2.stypeid,-1)  

    and (
            (
                t3.uniqueoid = 1
            )
            or
            (
                t3.uniqueoid is null and 
                isnull(t1.col1,'') = isnull(t2.col1,'') and 
                isnull(t1.col2,'') = isnull(t2.col2,'') and
                isnull(t1.rdid,-1) = isnull(t2.rdid,-1) and 
                isnull(t1.stid,-1) = isnull(t2.stid,-1) and
                isnull(t1.huaid,-1) = isnull(t2.huaid,-1) and
                isnull(t1.lpid,-1) = isnull(t2.lpid,-1) and
                isnull(t1.col3,-1) = isnull(t2.col3,-1) 
            )
    )

376

asked May 02 '11 15:05

Dustin Davis

3 Answers

Why self join: this is an aggregate question.

Hope you have an index on col1, col2, ...

--DELETE table
--WHERE KeyCol NOT IN (
select
    MIN(KeyCol) AS RowToKeep,
    col1, col2, 
from
    table
GROUP BY
    col12, col2
HAVING
   COUNT(*) > 1
--)

However, this will take some time. Have a look at bulk delete techniques

answered Oct 21 '22 03:10

gbn

You can use ROW_NUMBER() to find duplicate rows in one table.

You can check here

answered Oct 21 '22 05:10

Bruno Costa

The two methods you give should be equivalent. I think most SQL engines would do exactly the same thing in both cases.

And, by the way, this won't work. You have to have at least one field that is differernt or every record will match itself.

You might want to try something more like:

select col1, col2, col3
from table
group by col1, col2, col3
having count(*)>1

answered Oct 21 '22 05:10

Jay

Related questions
                            
                                SQL Server and .NET: insert fails (silently!) in code but not when run manually
                            
                                In SQL Server change column of type int to type text
                            
                                Separating record returned from function in postgres
                            
                                Dump the body of a function or procedure in sqlplus
                            
                                Why does NVL always evaluate 2nd parameter
                            
                                mysql join table on itself
                            
                                How to Execute stored procedure from SQL Plus?
                            
                                Why doesn't this SQL UPDATE query work?
                            
                                INNER JOIN keywords | with and without using them
                            
                                Order by Maximum condition match
                            
                                How can I find duplicate consecutive values in this table?
                            
                                Update statement running for too long or not
                            
                                How do transactions within Oracle stored procedures work? Is there an implicit transaction?
                            
                                Slow query when connecting to linked server
                            
                                Converting SQL FLOAT to SQL INT, lost data
                            
                                Creating a trigger to only run when a new table is being created
                            
                                Weighted conditions in the WHERE clause of a SQL statement
                            
                                Using 'LIKE' with an 'IN' clause full of strings
                            
                                Many to One relationship with SQLAlchemy in the same table
                            
                                how to use distinct in ms access

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best self join technique when checking for duplicates

Tags:

sql

sql-server-2008

Dustin Davis

People also ask

3 Answers

gbn

Bruno Costa

Jay

Recent Activity

Donate For Us