Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicated subsets from very large table

The data I'm working with is fairly complicated, so I'm just going to provide a simpler example so I can hopefully expand that out to what I'm working on.

Note: I've already found a way to do it, but it's extremely slow and not scalable. It works great on small datasets, but if I applied it to the actual tables it needs to run on, it would take forever.

I need to remove entire duplicate subsets of data within a table. Removing duplicate rows is easy, but I'm stuck finding an efficient way to remove duplicate subsets.

Example:

GroupID  Subset Value
-------  ----   ----
1        a      1
1        a      2
1        a      3

1        b      1
1        b      3
1        b      5

1        c      1
1        c      3
1        c      5


2        a      1
2        a      2
2        a      3

2        b      4
2        b      5
2        b      6

2        c      1
2        c      3
2        c      6

So in this example, from GroupID 1, I would need to remove either subset 'b' or subset 'c', doesn't matter which since both contain Values 1,2,3. For GroupID 2, none of the sets are duplicated, so none are removed.

Here's the code I used to solve this on a small scale. It works great, but when applied to 10+ Million records...you can imagine it would be very slow (I was later informed of the number of records, the sample data I was given was much smaller)...:

DECLARE @values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO @values (GroupID, SubSet, [Value])
VALUES  (1,'a',1),(1,'a',2),(1,'a',3)  ,(1,'b',1),(1,'b',3),(1,'b',5)  ,(1,'c',1),(1,'c',3),(1,'c',5),
        (2,'a',1),(2,'a',2),(2,'a',3)  ,(2,'b',2),(2,'b',4),(2,'b',6)  ,(2,'c',1),(2,'c',3),(2,'c',6)

SELECT *
FROM @values v
ORDER BY v.GroupID, v.SubSet, v.[Value]

SELECT x.GroupID, x.NameValues, MIN(x.SubSet)
FROM (
    SELECT t1.GroupID, t1.SubSet
        , NameValues = (SELECT ',' + CONVERT(VARCHAR(10), t2.[Value]) FROM @values t2 WHERE t1.GroupID = t2.GroupID AND t1.SubSet = t2.SubSet ORDER BY t2.[Value] FOR XML PATH(''))
    FROM @values t1
    GROUP BY t1.GroupID, t1.SubSet
) x
GROUP BY x.GroupID, x.NameValues

All I'm doing here is grouping by GroupID and Subset and concatenating all of the values into a comma delimited string...and then taking that and grouping on GroupID and Value list, and taking the MIN subset.

like image 914
Chad Baldwin Avatar asked Feb 06 '19 22:02

Chad Baldwin


People also ask

How do you remove duplicates from a table?

If a table has duplicate rows, we can delete it by using the DELETE statement. In the case, we have a column, which is not the part of group used to evaluate the duplicate records in the table.

How do you delete duplicate records from a table using SQL query?

1) First identify the rows those satisfy the definition of duplicate and insert them into temp table, say #tableAll . 2) Select non-duplicate(single-rows) or distinct rows into temp table say #tableUnique. 3) Delete from source table joining #tableAll to delete the duplicates.

How do you delete duplicate records in SQL Server and keep one record?

So to delete the duplicate record with SQL Server we can use the SET ROWCOUNT command to limit the number of rows affected by a query. By setting it to 1 we can just delete one of these rows in the table. Note: the select commands are just used to show the data prior and after the delete occurs.


2 Answers

I'd go with something like this:

;with cte as
(
    select v.GroupID, v.SubSet, checksum_agg(v.Value) h, avg(v.Value) a
    from @values v
    group by v.GroupID, v.SubSet
)

delete v
from @values v
join
(
    select c1.GroupID, case when c1.SubSet > c2.SubSet then c1.SubSet else c2.SubSet end SubSet
    from cte c1
    join cte c2 on c1.GroupID = c2.GroupID and c1.SubSet <> c2.SubSet and c1.h = c2.h and c1.a = c2.a
)x on v.GroupID = x.GroupID and v.SubSet = x.SubSet

select *
from @values
like image 74
Kirill Polishchuk Avatar answered Oct 05 '22 22:10

Kirill Polishchuk


From Checksum_Agg:

The CHECKSUM_AGG result does not depend on the order of the rows in the table.

This is because it is a sum of the values: 1 + 2 + 3 = 3 + 2 + 1 = 3 + 3 = 6.

HashBytes is designed to produce a different value for two inputs that differ only in the order of the bytes, as well as other differences. (There is a small possibility that two inputs, perhaps of wildly different lengths, could hash to the same value. You can't take an arbitrary input and squeeze it down to an absolutely unique 16-byte value.)

The following code demonstrates how to use HashBytes to return for each GroupId/Subset.

-- Thanks for the sample data!
DECLARE @values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO @values (GroupID, SubSet, [Value])
VALUES  (1,'a',1),(1,'a',2),(1,'a',3)  ,(1,'b',1),(1,'b',3),(1,'b',5)  ,(1,'c',1),(1,'c',3),(1,'c',5),
        (2,'a',1),(2,'a',2),(2,'a',3)  ,(2,'b',2),(2,'b',4),(2,'b',6)  ,(2,'c',1),(2,'c',3),(2,'c',6);

SELECT *
FROM @values v
ORDER BY v.GroupID, v.SubSet, v.[Value];

with
  DistinctGroups as (
    select distinct GroupId, Subset
      from @Values ),
  GroupConcatenatedValues as (
    select GroupId, Subset, Convert( VarBinary(256), (
      select Convert( VarChar(8000), Cast( Value as Binary(4) ), 2 ) AS [text()]
        from @Values as V
        where V.GroupId = DG.GroupId and V.SubSet = DG.SubSet
        order by Value
        for XML Path('') ), 2 ) as GroupedBinary
     from DistinctGroups as DG )
  -- To see the intermediate results from the CTE you can use one of the
  --   following two queries instead of the last   select :
  --   select * from DistinctGroups;
  --   select * from GroupConcatenatedValues;
  select GroupId, Subset, GroupedBinary, HashBytes( 'MD4', GroupedBinary ) as Hash
    from GroupConcatenatedValues
    order by GroupId, Subset;
like image 39
HABO Avatar answered Oct 05 '22 22:10

HABO