I have a table with two columns of IDs, like so:
╔════════╦══════╗
║ Master ║ Dupe ║
╠════════╬══════╣
║ 2 ║ 7 ║
║ 3 ║ 6 ║
║ 6 ║ 7 ║
║ 20 ║ 25 ║
║ 75 ║ 25 ║
╚════════╩══════╝
Each row represents the IDs of two rows in a sql table that are deemed to be duplicates of each other.
This table can contain many thousands of entries, with no guarantees for the data other than the Master
column being sorted in ascending order as pictured. Either column could contain the same ID as the other column, potentially against different or the same partner ID. Again - no guarantees.
From this table I would like get an index of Master and all of its possible dupes. Like pictured below.
Desired outcomes:
For the above, the desired output would look like so (but the columns don't HAVE to be sorted):
╔════════╦══════╗
║ Master ║ Dupe ║
╠════════╬══════╣
║ 2 ║ 3 ║
║ 2 ║ 6 ║
║ 2 ║ 7 ║
║ 20 ║ 25 ║
║ 20 ║ 75 ║
╚════════╩══════╝
I'm finding it difficult to explain this problem so my googling has not returned much. I'm thinking there must be an algorithm somewhere for iterating through a list of tuples like this and discovering duplication.
Any help appreciated!
EDIT: I've modified the example tables to better explain what their contents might look like.
Some notes to consider,
From what I can see, the problem looks to be recursive, I think LukStorms is on the right track but I can't quite figure it out
ANSWERED: While both solutions below from @artm and @LukStorms seem to work, I found the latter to be a little more succinct and readable. Thank you both! Fantastic help on a tough question. I only wish I could award the answer to you both
To remove duplicate tuples from a list of tuples: Use the set() class to convert the list to a set of tuples. Any duplicate tuples will automatically get removed after the conversion. Use the list() class to convert the set back to a list.
SQL Delete Duplicate Rows using Group By and Having Clause According to Delete Duplicate Rows in SQL, for finding duplicate rows, you need to use the SQL GROUP BY clause. The COUNT function can be used to verify the occurrence of a row using the Group by clause, which groups data according to the given columns.
Duplicate tuples are not allowed in a relation because they create redundancy of data inside a data base which slows down the data processing like querying, inserting, deleting, updating etc. of the database.
Try this. Get the min of master from your table with a CTE and cross join to all other values in the table.
;WITH minmaster as (select MIN(MASTER) master
FROM myTable)
select distinct m.master
, i.dupe
from minmaster m
cross join (select dupe dupe from myTable union all select master from myTable) i
WHERE i.dupe <> m.master
Update:
After your edit with more rows this below works although I'm not sure if it's the best solution. Logic was start with the first master dupe (since data is sorted by master), if the dupe exists on the 2nd column where first column is not equal to current master, then take the same master, otherwise take the next master. It's hard to explain, someone else can probably find an easier solution.
;WITH myTable AS
(SELECT 2 MASTER, 7 dupe
UNION all SELECT 3, 6
UNION all SELECT 6, 7
UNION all SELECT 20, 25
UNION all SELECT 75, 25
UNION all SELECT 100, 125
UNION all SELECT 150, 300
UNION all SELECT 180, 300
)
, cte AS
(
SELECT m.master L, m.dupe R, ROW_NUMBER() OVER (ORDER BY master) rnkC
FROM myTable m
)
, cte2 AS
(
SELECT m.master L, m.dupe R, ROW_NUMBER() OVER (ORDER BY master) rnkC2
FROM myTable m
)
, cteCur AS
(
SELECT TOP 1 cte.l, cte.R, cte.rnkC
FROM cte
UNION ALL
SELECT
CASE WHEN cteCur.r IN (SELECT dupe
FROM myTable
WHERE MASTER <> cteCur.L AND dupe = cteCur.R)
THEN cteCur.L
ELSE (SELECT cte2.L
FROM cte2
WHERE cte2.rnkC2 = cteCur.rnkC + 1)
END
, CASE WHEN cteCur.r IN (SELECT dupe
FROM myTable
WHERE MASTER <> cteCur.L AND dupe = cteCur.R)
THEN (SELECT cte2.L
FROM cte2
WHERE cte2.R = cteCur.R AND cte2.L <> cteCur.L)
ELSE (SELECT cte2.R
FROM cte2
WHERE cte2.rnkC2 = cteCur.rnkC + 1)
END
, cteCur.rnkC + 1
FROM cteCur
WHERE cteCur.L IS NOT NULL
)
SELECT cteCur.L Master
, cteCur.R Dupe
FROM cteCur
WHERE L IS NOT NULL
ORDER BY L, R
Here's an example that uses a recursive CTE to connect those duplicates.
But to make sure that duplicates are all in both directions, the DUPES CTE is used.
declare @DuplicateTest table (Master int, Dupe int);
insert into @DuplicateTest (Master, Dupe) values
(3,6),(6,7),(2,7),
(20,25),(75,25);
;with DUPES as
(
select distinct Master as Dupe1, Dupe as Dupe2 from @DuplicateTest
union
select distinct Dupe, Master from @DuplicateTest
)
,RCTE as
(
select Dupe1 as Base, 0 as Level, Dupe1, Dupe2
from DUPES
union all
select r.Base, (r.Level + 1), d.Dupe1, d.Dupe2
from RCTE r
join DUPES d on (r.Dupe2 = d.Dupe1
and r.Dupe1 != d.Dupe2 -- don't loop on the reverse
and r.Base != d.Dupe2 -- don't repeat what we started from
and r.Level < 100) -- if the level gets to big it's most likely a loop
)
select min(Dupe2) as Master, Base as Dupe
from RCTE
group by Base
having Base > min(Dupe2)
order by Base;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With