Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Deduplicate List of Tuples

I have a table with two columns of IDs, like so:

╔════════╦══════╗
║ Master ║ Dupe ║
╠════════╬══════╣
║ 2      ║ 7    ║
║ 3      ║ 6    ║
║ 6      ║ 7    ║
║ 20     ║ 25   ║
║ 75     ║ 25   ║
╚════════╩══════╝

Each row represents the IDs of two rows in a sql table that are deemed to be duplicates of each other.

This table can contain many thousands of entries, with no guarantees for the data other than the Master column being sorted in ascending order as pictured. Either column could contain the same ID as the other column, potentially against different or the same partner ID. Again - no guarantees.

From this table I would like get an index of Master and all of its possible dupes. Like pictured below.

Desired outcomes:

  1. Lowest ID should be kept as master
  2. All subsequent dupes of a dupe should map back to the same (lowest ID) master

For the above, the desired output would look like so (but the columns don't HAVE to be sorted):

╔════════╦══════╗
║ Master ║ Dupe ║
╠════════╬══════╣
║ 2      ║ 3    ║
║ 2      ║ 6    ║
║ 2      ║ 7    ║
║ 20     ║ 25   ║
║ 20     ║ 75   ║
╚════════╩══════╝

I'm finding it difficult to explain this problem so my googling has not returned much. I'm thinking there must be an algorithm somewhere for iterating through a list of tuples like this and discovering duplication.

Any help appreciated!

EDIT: I've modified the example tables to better explain what their contents might look like.

Some notes to consider,

  • There is no guarantee of a chain. It could all be one big chain, lots of small ones or none at all.
  • There is no guarantee that all pairs appear in reverse order somewhere else in the table

From what I can see, the problem looks to be recursive, I think LukStorms is on the right track but I can't quite figure it out

ANSWERED: While both solutions below from @artm and @LukStorms seem to work, I found the latter to be a little more succinct and readable. Thank you both! Fantastic help on a tough question. I only wish I could award the answer to you both

like image 481
Sean Missingham Avatar asked Jun 13 '17 06:06

Sean Missingham


People also ask

How do I remove duplicates in a list of tuples?

To remove duplicate tuples from a list of tuples: Use the set() class to convert the list to a set of tuples. Any duplicate tuples will automatically get removed after the conversion. Use the list() class to convert the set back to a list.

How do you eliminate duplicate tuples in SQL?

SQL Delete Duplicate Rows using Group By and Having Clause According to Delete Duplicate Rows in SQL, for finding duplicate rows, you need to use the SQL GROUP BY clause. The COUNT function can be used to verify the occurrence of a row using the Group by clause, which groups data according to the given columns.

Why are duplicate tuples not allowed?

Duplicate tuples are not allowed in a relation because they create redundancy of data inside a data base which slows down the data processing like querying, inserting, deleting, updating etc. of the database.


2 Answers

Try this. Get the min of master from your table with a CTE and cross join to all other values in the table.

;WITH minmaster as (select MIN(MASTER) master
FROM myTable)
select distinct m.master
, i.dupe
from minmaster m 
cross join (select dupe dupe from myTable union all select master from myTable) i
WHERE i.dupe <> m.master

Update:

After your edit with more rows this below works although I'm not sure if it's the best solution. Logic was start with the first master dupe (since data is sorted by master), if the dupe exists on the 2nd column where first column is not equal to current master, then take the same master, otherwise take the next master. It's hard to explain, someone else can probably find an easier solution.

;WITH myTable AS 
(SELECT 2 MASTER, 7 dupe
UNION all SELECT 3, 6
UNION all SELECT 6, 7
UNION all SELECT 20, 25
UNION all SELECT 75, 25
UNION all SELECT 100, 125
UNION all SELECT 150, 300
UNION all SELECT 180, 300
)
, cte AS 
(
SELECT m.master L, m.dupe R, ROW_NUMBER() OVER (ORDER BY master) rnkC
FROM myTable m
)
, cte2 AS 
(
SELECT m.master L, m.dupe R, ROW_NUMBER() OVER (ORDER BY master) rnkC2
FROM myTable m
)
, cteCur AS 
(
SELECT TOP 1 cte.l, cte.R, cte.rnkC
FROM cte
UNION ALL
SELECT 
CASE WHEN cteCur.r IN (SELECT dupe 
                        FROM myTable 
                        WHERE MASTER <> cteCur.L AND dupe = cteCur.R) 
    THEN cteCur.L 
    ELSE (SELECT cte2.L 
            FROM cte2 
            WHERE cte2.rnkC2 = cteCur.rnkC + 1) 
    END
, CASE WHEN cteCur.r IN (SELECT dupe 
                            FROM myTable 
                            WHERE MASTER <> cteCur.L AND dupe = cteCur.R) 
        THEN (SELECT cte2.L 
                FROM cte2 
                WHERE cte2.R = cteCur.R AND cte2.L <> cteCur.L) 
        ELSE (SELECT cte2.R 
                FROM cte2 
                WHERE cte2.rnkC2 = cteCur.rnkC + 1) 
        END
, cteCur.rnkC + 1
FROM cteCur
WHERE cteCur.L IS NOT NULL
)
SELECT cteCur.L Master
, cteCur.R Dupe
FROM cteCur
WHERE L IS NOT NULL
ORDER BY L, R
like image 108
artm Avatar answered Oct 10 '22 05:10

artm


Here's an example that uses a recursive CTE to connect those duplicates.

But to make sure that duplicates are all in both directions, the DUPES CTE is used.

declare @DuplicateTest table (Master int, Dupe int);

insert into @DuplicateTest (Master, Dupe) values 
(3,6),(6,7),(2,7),
(20,25),(75,25);

;with DUPES as
(
     select distinct Master as Dupe1, Dupe as Dupe2 from @DuplicateTest
     union
     select distinct Dupe, Master from @DuplicateTest
)
,RCTE as
(
   select Dupe1 as Base, 0 as Level, Dupe1, Dupe2
   from DUPES

   union all

   select r.Base, (r.Level + 1), d.Dupe1, d.Dupe2
   from RCTE r
   join DUPES d on (r.Dupe2 = d.Dupe1 
                    and r.Dupe1 != d.Dupe2 -- don't loop on the reverse
                    and r.Base != d.Dupe2 -- don't repeat what we started from
                    and r.Level < 100) -- if the level gets to big it's most likely a loop
)
select min(Dupe2) as Master, Base as Dupe
from RCTE
group by Base
having Base > min(Dupe2)
order by Base;
like image 31
LukStorms Avatar answered Oct 10 '22 03:10

LukStorms