Remove duplicated subsets from very large table

Tags:

The data I'm working with is fairly complicated, so I'm just going to provide a simpler example so I can hopefully expand that out to what I'm working on.

Note: I've already found a way to do it, but it's extremely slow and not scalable. It works great on small datasets, but if I applied it to the actual tables it needs to run on, it would take forever.

I need to remove entire duplicate subsets of data within a table. Removing duplicate rows is easy, but I'm stuck finding an efficient way to remove duplicate subsets.

Example:

GroupID  Subset Value
-------  ----   ----
1        a      1
1        a      2
1        a      3

1        b      1
1        b      3
1        b      5

1        c      1
1        c      3
1        c      5


2        a      1
2        a      2
2        a      3

2        b      4
2        b      5
2        b      6

2        c      1
2        c      3
2        c      6

So in this example, from GroupID 1, I would need to remove either subset 'b' or subset 'c', doesn't matter which since both contain Values 1,2,3. For GroupID 2, none of the sets are duplicated, so none are removed.

Here's the code I used to solve this on a small scale. It works great, but when applied to 10+ Million records...you can imagine it would be very slow (I was later informed of the number of records, the sample data I was given was much smaller)...:

DECLARE @values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO @values (GroupID, SubSet, [Value])
VALUES  (1,'a',1),(1,'a',2),(1,'a',3)  ,(1,'b',1),(1,'b',3),(1,'b',5)  ,(1,'c',1),(1,'c',3),(1,'c',5),
        (2,'a',1),(2,'a',2),(2,'a',3)  ,(2,'b',2),(2,'b',4),(2,'b',6)  ,(2,'c',1),(2,'c',3),(2,'c',6)

SELECT *
FROM @values v
ORDER BY v.GroupID, v.SubSet, v.[Value]

SELECT x.GroupID, x.NameValues, MIN(x.SubSet)
FROM (
    SELECT t1.GroupID, t1.SubSet
        , NameValues = (SELECT ',' + CONVERT(VARCHAR(10), t2.[Value]) FROM @values t2 WHERE t1.GroupID = t2.GroupID AND t1.SubSet = t2.SubSet ORDER BY t2.[Value] FOR XML PATH(''))
    FROM @values t1
    GROUP BY t1.GroupID, t1.SubSet
) x
GROUP BY x.GroupID, x.NameValues

All I'm doing here is grouping by GroupID and Subset and concatenating all of the values into a comma delimited string...and then taking that and grouping on GroupID and Value list, and taking the MIN subset.

914

asked Feb 06 '19 22:02

Chad Baldwin

2 Answers

I'd go with something like this:

;with cte as
(
    select v.GroupID, v.SubSet, checksum_agg(v.Value) h, avg(v.Value) a
    from @values v
    group by v.GroupID, v.SubSet
)

delete v
from @values v
join
(
    select c1.GroupID, case when c1.SubSet > c2.SubSet then c1.SubSet else c2.SubSet end SubSet
    from cte c1
    join cte c2 on c1.GroupID = c2.GroupID and c1.SubSet <> c2.SubSet and c1.h = c2.h and c1.a = c2.a
)x on v.GroupID = x.GroupID and v.SubSet = x.SubSet

select *
from @values

answered Oct 05 '22 22:10

Kirill Polishchuk

From Checksum_Agg:

The CHECKSUM_AGG result does not depend on the order of the rows in the table.

This is because it is a sum of the values: 1 + 2 + 3 = 3 + 2 + 1 = 3 + 3 = 6.

HashBytes is designed to produce a different value for two inputs that differ only in the order of the bytes, as well as other differences. (There is a small possibility that two inputs, perhaps of wildly different lengths, could hash to the same value. You can't take an arbitrary input and squeeze it down to an absolutely unique 16-byte value.)

The following code demonstrates how to use HashBytes to return for each GroupId/Subset.

-- Thanks for the sample data!
DECLARE @values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO @values (GroupID, SubSet, [Value])
VALUES  (1,'a',1),(1,'a',2),(1,'a',3)  ,(1,'b',1),(1,'b',3),(1,'b',5)  ,(1,'c',1),(1,'c',3),(1,'c',5),
        (2,'a',1),(2,'a',2),(2,'a',3)  ,(2,'b',2),(2,'b',4),(2,'b',6)  ,(2,'c',1),(2,'c',3),(2,'c',6);

SELECT *
FROM @values v
ORDER BY v.GroupID, v.SubSet, v.[Value];

with
  DistinctGroups as (
    select distinct GroupId, Subset
      from @Values ),
  GroupConcatenatedValues as (
    select GroupId, Subset, Convert( VarBinary(256), (
      select Convert( VarChar(8000), Cast( Value as Binary(4) ), 2 ) AS [text()]
        from @Values as V
        where V.GroupId = DG.GroupId and V.SubSet = DG.SubSet
        order by Value
        for XML Path('') ), 2 ) as GroupedBinary
     from DistinctGroups as DG )
  -- To see the intermediate results from the CTE you can use one of the
  --   following two queries instead of the last   select :
  --   select * from DistinctGroups;
  --   select * from GroupConcatenatedValues;
  select GroupId, Subset, GroupedBinary, HashBytes( 'MD4', GroupedBinary ) as Hash
    from GroupConcatenatedValues
    order by GroupId, Subset;

answered Oct 05 '22 22:10

HABO

Related questions
                            
                                dropdb mydb not working in postgres
                            
                                ODBC/DBI in R will not write to a table with a non-default schema in R
                            
                                pivot rows to 14 columns as 7 tuples
                            
                                What would be the difference between WITH clause & temporary table?
                            
                                What is IsNull in HQL?
                            
                                How many lines are executed after IF?
                            
                                Show correct result with SQL Joins
                            
                                Generate ID based on multiple columns
                            
                                How to insert a row into another table using last inserted ID?
                            
                                How to import a SQLite3 database into Python Jupyter Notebook?
                            
                                Cannot drop a role that is granted to connect database
                            
                                SQL Insert multiple record while using ON DUPLICATE KEY UPDATE
                            
                                SQL SUM on multiple INNER JOIN
                            
                                Newbie question: Problem with results, sql, join, where, "<" operator
                            
                                How to parse XML data in SQL server table
                            
                                How to have auto increment in ClickHouse?
                            
                                Comparing two columns in postgres database
                            
                                Update table using JSON in SQL
                            
                                SQL Server select variable where no results
                            
                                Get identity of row inserted in Snowflake Datawarehouse

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remove duplicated subsets from very large table

Tags:

sql

sql-server

tsql

Chad Baldwin

People also ask

2 Answers

Kirill Polishchuk

HABO

Recent Activity

Donate For Us