SQL Deduplicate List of Tuples

Tags:

I have a table with two columns of IDs, like so:

╔════════╦══════╗
║ Master ║ Dupe ║
╠════════╬══════╣
║ 2      ║ 7    ║
║ 3      ║ 6    ║
║ 6      ║ 7    ║
║ 20     ║ 25   ║
║ 75     ║ 25   ║
╚════════╩══════╝

Each row represents the IDs of two rows in a sql table that are deemed to be duplicates of each other.

This table can contain many thousands of entries, with no guarantees for the data other than the Master column being sorted in ascending order as pictured. Either column could contain the same ID as the other column, potentially against different or the same partner ID. Again - no guarantees.

From this table I would like get an index of Master and all of its possible dupes. Like pictured below.

Desired outcomes:

Lowest ID should be kept as master
All subsequent dupes of a dupe should map back to the same (lowest ID) master

For the above, the desired output would look like so (but the columns don't HAVE to be sorted):

Click to copy

╔════════╦══════╗
║ Master ║ Dupe ║
╠════════╬══════╣
║ 2      ║ 3    ║
║ 2      ║ 6    ║
║ 2      ║ 7    ║
║ 20     ║ 25   ║
║ 20     ║ 75   ║
╚════════╩══════╝

I'm finding it difficult to explain this problem so my googling has not returned much. I'm thinking there must be an algorithm somewhere for iterating through a list of tuples like this and discovering duplication.

Any help appreciated!

EDIT: I've modified the example tables to better explain what their contents might look like.

Some notes to consider,

There is no guarantee of a chain. It could all be one big chain, lots of small ones or none at all.
There is no guarantee that all pairs appear in reverse order somewhere else in the table

From what I can see, the problem looks to be recursive, I think LukStorms is on the right track but I can't quite figure it out

ANSWERED: While both solutions below from @artm and @LukStorms seem to work, I found the latter to be a little more succinct and readable. Thank you both! Fantastic help on a tough question. I only wish I could award the answer to you both

481

asked Jun 13 '17 06:06

Sean Missingham

2 Answers

Try this. Get the min of master from your table with a CTE and cross join to all other values in the table.

Click to copy

;WITH minmaster as (select MIN(MASTER) master
FROM myTable)
select distinct m.master
, i.dupe
from minmaster m 
cross join (select dupe dupe from myTable union all select master from myTable) i
WHERE i.dupe <> m.master

Update:

After your edit with more rows this below works although I'm not sure if it's the best solution. Logic was start with the first master dupe (since data is sorted by master), if the dupe exists on the 2nd column where first column is not equal to current master, then take the same master, otherwise take the next master. It's hard to explain, someone else can probably find an easier solution.

Click to copy

;WITH myTable AS 
(SELECT 2 MASTER, 7 dupe
UNION all SELECT 3, 6
UNION all SELECT 6, 7
UNION all SELECT 20, 25
UNION all SELECT 75, 25
UNION all SELECT 100, 125
UNION all SELECT 150, 300
UNION all SELECT 180, 300
)
, cte AS 
(
SELECT m.master L, m.dupe R, ROW_NUMBER() OVER (ORDER BY master) rnkC
FROM myTable m
)
, cte2 AS 
(
SELECT m.master L, m.dupe R, ROW_NUMBER() OVER (ORDER BY master) rnkC2
FROM myTable m
)
, cteCur AS 
(
SELECT TOP 1 cte.l, cte.R, cte.rnkC
FROM cte
UNION ALL
SELECT 
CASE WHEN cteCur.r IN (SELECT dupe 
                        FROM myTable 
                        WHERE MASTER <> cteCur.L AND dupe = cteCur.R) 
    THEN cteCur.L 
    ELSE (SELECT cte2.L 
            FROM cte2 
            WHERE cte2.rnkC2 = cteCur.rnkC + 1) 
    END
, CASE WHEN cteCur.r IN (SELECT dupe 
                            FROM myTable 
                            WHERE MASTER <> cteCur.L AND dupe = cteCur.R) 
        THEN (SELECT cte2.L 
                FROM cte2 
                WHERE cte2.R = cteCur.R AND cte2.L <> cteCur.L) 
        ELSE (SELECT cte2.R 
                FROM cte2 
                WHERE cte2.rnkC2 = cteCur.rnkC + 1) 
        END
, cteCur.rnkC + 1
FROM cteCur
WHERE cteCur.L IS NOT NULL
)
SELECT cteCur.L Master
, cteCur.R Dupe
FROM cteCur
WHERE L IS NOT NULL
ORDER BY L, R

108

answered Oct 10 '22 05:10

artm

Here's an example that uses a recursive CTE to connect those duplicates.

But to make sure that duplicates are all in both directions, the DUPES CTE is used.

Click to copy

declare @DuplicateTest table (Master int, Dupe int);

insert into @DuplicateTest (Master, Dupe) values 
(3,6),(6,7),(2,7),
(20,25),(75,25);

;with DUPES as
(
     select distinct Master as Dupe1, Dupe as Dupe2 from @DuplicateTest
     union
     select distinct Dupe, Master from @DuplicateTest
)
,RCTE as
(
   select Dupe1 as Base, 0 as Level, Dupe1, Dupe2
   from DUPES

   union all

   select r.Base, (r.Level + 1), d.Dupe1, d.Dupe2
   from RCTE r
   join DUPES d on (r.Dupe2 = d.Dupe1 
                    and r.Dupe1 != d.Dupe2 -- don't loop on the reverse
                    and r.Base != d.Dupe2 -- don't repeat what we started from
                    and r.Level < 100) -- if the level gets to big it's most likely a loop
)
select min(Dupe2) as Master, Base as Dupe
from RCTE
group by Base
having Base > min(Dupe2)
order by Base;

answered Oct 10 '22 03:10

LukStorms

Related questions
                            
                                HiveQL - String contains equivalent in hiveql UDF?
                            
                                Union with unequal number of columns in Laravel
                            
                                How to split a string that appears twice?
                            
                                SQL select merge two columns into one
                            
                                Sequelize upsert() never updates and only inserts
                            
                                Consider a single record, per id, in a group by
                            
                                Filtered unique constraint or similar tool
                            
                                Subquery with count in SQLAlchemy
                            
                                SQL Error 17268 : Year out of range (Java/Spring)
                            
                                Is adding and dropping indexes everyday on huge tables a good practice?
                            
                                mysql UPDATE says column cannot be null. Why is it null?
                            
                                SQL silently converts int to varchar, but then throws an error when it comes across a varchar?
                            
                                Group records by consecutive dates when dates are not exactly consecutive
                            
                                Update with Dapper using dynamic column name
                            
                                Spring @Query with Lower and Wildcards
                            
                                Find out if a business is currently open in T-SQL
                            
                                Pyspark: cast array with nested struct to string
                            
                                How to get the 2 greatest values between multiple columns?
                            
                                How to check if at least one of a group of rows has a specific value
                            
                                How can I continue a transaction in Spring Boot with PostgreSQL after an Exception occured?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SQL Deduplicate List of Tuples

Tags:

sql

sql-server

duplicates

recursion

Sean Missingham

People also ask

2 Answers

artm

LukStorms

Recent Activity

Donate For Us