Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best self join technique when checking for duplicates

i'm trying to optimize a query that is in production which is taking a long time. The goal is to find duplicate records based on matching field values criteria and then deleting them. The current query uses a self join via inner join on t1.col1 = t2.col1 then a where clause to check the values.

select * from table t1 
inner join table t2 on t1.col1 = t2.col1
where t1.col2 = t2.col2 ...

What would be a better way to do this? Or is it all the same based on indexes? Maybe

select * from table t1, table t2
where t1.col1 = t2.col1, t2.col2 = t2.col2 ...

this table has 100m+ rows.

MS SQL, SQL Server 2008 Enterprise

select distinct t2.id
    from table1 t1 with (nolock)
    inner join table1 t2 with (nolock) on  t1.ckid=t2.ckid
    left join table2 t3 on t1.cid = t3.cid and t1.typeid = t3.typeid
    where 
    t2.id > @Max_id and
    t2.timestamp > t1.timestamp and
    t2.rid = 2 and
    isnull(t1.col1,'') = isnull(t2.col1,'') and 
    isnull(t1.cid,-1) = isnull(t2.cid,-1) and
    isnull(t1.rid,-1) = isnull(t2.rid,-1)and 
    isnull(t1.typeid,-1) = isnull(t2.typeid,-1) and
    isnull(t1.cktypeid,-1) = isnull(t2.cktypeid,-1) and
    isnull(t1.oid,'') = isnull(t2.oid,'') and
    isnull(t1.stypeid,-1) = isnull(t2.stypeid,-1)  

    and (
            (
                t3.uniqueoid = 1
            )
            or
            (
                t3.uniqueoid is null and 
                isnull(t1.col1,'') = isnull(t2.col1,'') and 
                isnull(t1.col2,'') = isnull(t2.col2,'') and
                isnull(t1.rdid,-1) = isnull(t2.rdid,-1) and 
                isnull(t1.stid,-1) = isnull(t2.stid,-1) and
                isnull(t1.huaid,-1) = isnull(t2.huaid,-1) and
                isnull(t1.lpid,-1) = isnull(t2.lpid,-1) and
                isnull(t1.col3,-1) = isnull(t2.col3,-1) 
            )
    )
like image 376
Dustin Davis Avatar asked May 02 '11 15:05

Dustin Davis


People also ask

How do you prevent duplicates in self join?

Solution. Select column values in a specific order within rows to make rows with duplicate sets of values identical. Then you can use SELECT DISTINCT to remove duplicates. Alternatively, retrieve rows in such a way that near-duplicates are not even selected.

Which type of join is used to find duplicate records?

INNER JOIN: It is a keyword used for querying two tables to get the records having matching values in both the table.

How do you find duplicates in SQL using JOINs?

Check for Duplicates in Multiple Tables With INNER JOINUse the INNER JOIN function to find duplicates that exist in multiple tables. Sample syntax for an INNER JOIN function looks like this: SELECT column_name FROM table1 INNER JOIN table2 ON table1. column_name = table2.


3 Answers

Why self join: this is an aggregate question.

Hope you have an index on col1, col2, ...

--DELETE table
--WHERE KeyCol NOT IN (
select
    MIN(KeyCol) AS RowToKeep,
    col1, col2, 
from
    table
GROUP BY
    col12, col2
HAVING
   COUNT(*) > 1
--)

However, this will take some time. Have a look at bulk delete techniques

like image 73
gbn Avatar answered Oct 21 '22 03:10

gbn


You can use ROW_NUMBER() to find duplicate rows in one table.

You can check here

like image 33
Bruno Costa Avatar answered Oct 21 '22 05:10

Bruno Costa


The two methods you give should be equivalent. I think most SQL engines would do exactly the same thing in both cases.

And, by the way, this won't work. You have to have at least one field that is differernt or every record will match itself.

You might want to try something more like:

select col1, col2, col3
from table
group by col1, col2, col3
having count(*)>1
like image 27
Jay Avatar answered Oct 21 '22 05:10

Jay