How to Find Rows which are Duplicates by a Key but Not Duplicates in All Columns?

Tags:

I am working with a table which is an extract of a set of other tables. All of the rows of the extract table should be unique according to keys D1, D2 and D3. They are not. It appears that an earlier developer attempted to solve this problem by using a SELECT DISTINCT across all columns being queried from this table. This will work, but only if every row which is a duplicate on (D1, D2, D3) is also a duplicate across the non-key columns (ignoring the IDENTITY column that was added to the extract table).

In other words, given rows as follows:

D1  D2  D3  C4  C5  C6
=== === === === === ===
A   B   C   X1  X2  X3
A   B   C   X1  X2  X3

then

SELECT DISTINCT D1, D2, D3, C4, C5, C6
FROM BAD_TABLE

will "work", as there's no difference between the rows which are duplicated on (D1,D2,D3). But if the table contained

D1  D2  D3  C4  C5  C6
=== === === === === ===
A   B   C   X1  X2  X3
A   B   C   X1  X2  X4

then SELECT DISTINCT would return two rows for the key (A,B,C). Furthermore, we would have to decide which of X3 or X4 was the "correct" value.

I know how to find the duplicates on (D1,D2,D3). I even know how to find the duplicates across all the columns (other than the IDENTITY column):

;
WITH DUPLICATES(D1,D2,D3) AS
(
    SELECT D1, D2, D3
    FROM SOURCE
    GROUP BY D1, D2, D3
    HAVING COUNT(*)>1
)
SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
FROM SOURCE S
INNER JOIN DUPLICATES D
    ON S.D1 = D.D1 AND S.D2 = D.D2 AND S.D3 = D.D3
ORDER BY S.D1, S.D2, S.D3, S.C4, S.C5, S.C6

The question is, how do I find the subset of the above resultset which are duplicates on (D1,D2,D3), but not duplicates on (D1,D2,D3,C4,C5,C6)?

304

asked Nov 22 '10 17:11

John Saunders

2 Answers

You can do it by joining the table on itself, saying the D's are all equal and at least one of the C's are not equal.

CREATE TABLE #Source (
    D1 VARCHAR(2),
    D2 VARCHAR(2),
    D3 VARCHAR(2),
    C4 VARCHAR(2),
    C5 VARCHAR(2),
    C6 VARCHAR(2) );

INSERT INTO #Source VALUES ('A', 'B', 'C', 'X1', 'X2', 'X3');
INSERT INTO #Source VALUES ('A', 'B', 'C', 'X1', 'X2', 'X4');
INSERT INTO #Source VALUES ('A', 'B', 'D', 'X1', 'X2', 'X3');
INSERT INTO #Source VALUES ('A', 'B', 'D', 'X1', 'X2', 'X3');

SELECT S1.D1, S1.D2, S1.D3, S1.C4 C4_1, S2.C4 C4_2, S1.C5 C5_1, S2.C5 C5_2, S1.C6 C6_1, S2.C6 C6_2
FROM
    #Source S1
            INNER JOIN
    #Source S2
            ON
        (       S1.D1 = S2.D1 
            AND S1.D2 = S2.D2
            AND S1.D3 = S2.D3
            AND (   S1.C4 <> S2.C4
                 OR S1.C5 <> S2.C5
                 OR S1.C6 <> S2.C6
                 )
        );

DROP TABLE #Source;

Gives the following results:

D1   D2   D3   C4_1 C4_2 C5_1 C5_2 C6_1 C6_2
---- ---- ---- ---- ---- ---- ---- ---- ----
A    B    C    X1   X1   X2   X2   X4   X3
A    B    C    X1   X1   X2   X2   X3   X4

Also note that this is compatible with MS SQL 2000 as you later indicated is required in How to Convert a SQL Query using Common Table Expressions to One Without (for SQL Server 2000).

answered Oct 25 '22 08:10

Samuel Neff

I haven't had a chance to try Conrad's answer yet, but came up with one of my own. It's rather a "duh" moment.

So, if you want to find all the rows in set A except for those that are in set B, you use the EXCEPT operator:

; 
WITH KEYDUPLICATES(D1,D2,D3) AS 
( 
    SELECT D1, D2, D3 
    FROM SOURCE 
    GROUP BY D1, D2, D3 
    HAVING COUNT(*)>1 
),
KEYDUPLICATEROWS AS
( 
    SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6 
    FROM SOURCE S 
    INNER JOIN KEYDUPLICATES D 
        ON S.D1 = D.D1 AND S.D2 = D.D2 AND S.D3 = D.D3 
),
FULLDUPLICATES AS
(
    SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6 
    FROM SOURCE S
    GROUP BY S.D1, S.D2, S.D3, S.C4, S.C5, S.C6 
    HAVING COUNT(*)>1
)
SELECT KR.D1, KR.D2, KR.D3, KR.C4, KR.C5, KR.C6
FROM KEYDUPLICATEROWS AS KR
EXCEPT
SELECT FD.D1, FD.D2, FD.D3, FD.C4, FD.C5, FD.C6
FROM FULLDUPLICATES AS FD
ORDER BY D1, D2, D3, C4, C5, C6

This seems to be showing me 1500 rows which are duplicates across (D1,D2,D3), but which are only duplicates across a subset of (D1,D2,D3,C4,C5,C6). In fact, it appears they are duplicates across (D1,D2,D3,C4,C5).

How to confirm that will be the subject of another question.

answered Oct 25 '22 08:10

2 revs

Related questions
                            
                                Is there a quick way to check if ANY column is NULL?
                            
                                Understanding MySQL Cursor Types
                            
                                SQL Server passing identifiers to stored procedures/dynamic SQL
                            
                                Do a dry-run of an Alembic upgrade
                            
                                Mysql: Create inline table within select statement?
                            
                                The database [dbName] is not accessible. (ObjectExplorer)
                            
                                SQL - Query - max(count())
                            
                                Code reuse and modularity in SQL
                            
                                CommandType.Text vs CommandType.StoredProcedure
                            
                                Php & Sql Injection - UTF8 POC
                            
                                Visual Studio Diff Window: How to Edit .sql Files
                            
                                SQL: Most Overdue pair of numbers?
                            
                                How to use time-series with Sqlite, with fast time-range queries?
                            
                                ANT sql task: How to run SQL and PL/SQL and notice execution failure?
                            
                                Return Stream from WCF service, using SqlFileStream
                            
                                What is the best way to store single non-repeating data to a database?
                            
                                Ordering distinct column values by (first value of) other column in aggregate function
                            
                                ORM Select n + 1 performance; join or no join
                            
                                bulk insert from Java into Oracle
                            
                                TSQL - How to URL Encode

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to Find Rows which are Duplicates by a Key but Not Duplicates in All Columns?

Tags:

sql

tsql

sql-server-2008

John Saunders

People also ask

2 Answers

Samuel Neff

2 revs

Recent Activity

Donate For Us