Merging duplicated records together with "Merge" syntax

Tags:

I am using SQL Server 2014. I am currently trying to combine millions of personnel application records in to a single personnel record.

The records contain the following columns:

ID, First_Name, Last_Name, DOB, Post_Code, Mobile, Email

A person can enter their details numerous times but due to fat fingers or fraud they can sometimes put in, incorrect details.

In my example Christopher has filled his details in 5 times, First_Name, Last_Name, DOB are always correct, Post_Code, Mobile and Email contain various connotations.

What I want to do is take the min(id) associated with this group in this case 84015283 and put it in to a new table, this will be the primary key and then you will see the other id's that are associated with it.

Examples

NID       CID
------------------
84015283  84015283
84015283  84069198
84015283  84070263
84015283  84369603
84015283  85061159

Where it gets a little complicated is, where 2 different people can have the same First_Name, Last_Name and DOB, at least one of the other fields must match "post_code, mobile or email" as per my example to another record within the group.

Though first_name, last_name, DoB match between ID's 84015283, 84069198, 84070263. 84015283, 84069198 are identical so they would match without an issue, 84070263 matches on the postcode, 84369603 matches on the mobile to a previous record and 85061159 matches on a previous mobile/email but not post_code.

If putting the NID within the original dataset is easier I can go with this rather than putting it all in a separate table.

After some googling and trying to get my head around this, I believe that using "Merge" might be a good way to achieve what I am after but I am concerned it will take a very long time due to the number of records involved.

Also going forward any routine would have to be run on subsequent new records.

I have listed the code for the example if anyone can help

DROP TABLE customer_dist

CREATE TABLE [dbo].customer_dist
(
    [id] [int] NOT NULL,
    [First_Name] [varchar](50) NULL,
    [Last_Name] [varchar](50) NULL,
    [DoB] [date] NULL,
    [post_code] [varchar](50) NULL,
    [mobile] [varchar](50) NULL,
    [Email] [varchar](100) NULL,
)

INSERT INTO customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', '[email protected]'),
       ('84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', '[email protected]'),
       ('84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559822', '[email protected]'),
       ('84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA', '07089559829', '[email protected]'),
       ('85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA', '07089559829', '[email protected]'),
       ('87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', '[email protected]')

SELECT * FROM customer_dist

Below is the expected results, sorry I should of made it clearer what I wanted at the end.

Output Table Results

    NID         id          First_Name  Last_Name   DoB         post_code   mobile          Email
    84015283    84015283    Christopher Higg            1/13/1956   CH2 3AZ         7089559829  [email protected]
    84015283    84069198    Christopher Higg            1/13/1956   CH2 3AZ         7089559829  [email protected]
    84015283    84070263    Christopher Higg            1/13/1956   CH2 3AZ         7089559822  [email protected]
    84015283    84369603    Christopher Higg            1/13/1956   CH2 3ZA         7089559829  [email protected]
    84015283    85061159    CHRISTOPHER Higg            1/13/1956   CH2 3RA         7089559829  [email protected]
    78065122    87065122    Matthew Davis               05/10/1978  CH5 1TS

7077084692 [email protected]

OR                          

NID         id
84015283    84015283
84015283    84069198
84015283    84070263
84015283    84369603
84015283    85061159
87065122    87065122

Apologies for the slow response.

I have updated my required output, I was asked to include an extra record that was not a match to the other records but did not include this in my required output.

HABO's response was the closest to what was needed unfortunately on further testing with other sample data, duplicates were created and the logic broke down. Other Sample data would be :-

declare @customer_dist as Table (
    [id] [int] NOT NULL,
    [First_Name] [varchar](50) NULL,
    [Last_Name] [varchar](50) NULL,
    [DoB] [date] NULL,
    [post_code] [varchar](50) NULL,
    [mobile] [varchar](50) NULL,
    [Email] [varchar](100) NULL );


INSERT INTO @customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('32006455', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07706212920',  '[email protected]'),
       ('35963960', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07484863324',  '[email protected]'),
       ('38627975', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07484863478',  '[email protected]'),
       ('46653041', 'Mary', 'WILSON',   '1983-09-20',   'BT62JA',   '07483888179',  '[email protected]'),
       ('48023677', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07483888179',  '[email protected]'),
       ('49560434', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07849727199',  '[email protected]'),
       ('49861032', 'Mary', 'WILSON',   '1983-09-20',   'BT62JA',   '07849727199',  '[email protected]'),
       ('53130969', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07849727199',  '[email protected]'),
       ('33843283', 'Mary', 'Wilson',   '1983-09-20',   'BT148HU',  '07484863478',  '[email protected]'),
       ('38627975', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07484863478',  '[email protected]')

SELECT * FROM @customer_dist;

824

asked Aug 22 '18 14:08

PJD

1 Answers

I used to work in a very old school insurance company where they had similar issue with data.

My main attempt here is to narrow the result set with duplicates in it, kind of finding what ties dups together. Once you get this, the rest of the solutions comes really fast.

Logic is: join table to itself base on columns that share the same values (Fname, Lname, DOB) and occasionally same values (post_code, mobile, email) and more importantly ids should NOT match, this makes sure to exclude non-dup records and leave only dups.

After you have dups only, find MIN(id) and put it inside cte, join to the original table and there you are. Non-dup records don't need min-id because it's id is min-id.

;WITH DUPS AS
(
SELECT DISTINCT
    MIN(C1.ID) OVER(PARTITION BY C1.First_Name, 
C1.Last_Name, C1.DoB) AS minid,
    C1.id, C1.First_Name, C1.Last_Name, C1.DoB
FROM customer_dist c1 
INNER join customer_dist c2 
ON
c1.First_Name = c2.First_Name
AND c1.Last_Name = c2.Last_Name
AND c1.DoB = c2.DoB
AND (c1.post_code = c2.post_code OR c1.mobile = c2.mobile 
OR 
c1.Email = c2.Email)
AND C1.ID <> C2.ID
)

SELECT ISNULL(D.minid, C.ID) AS NID,
        C.*
FROM customer_dist C
LEFT JOIN DUPS D ON C.id = D.id

112

answered Oct 14 '22 19:10

Batman

Related questions
                            
                                In SQL is there a difference between count(*) and count(<fieldname>)
                            
                                SQL How to Update only first Row
                            
                                One 400GB table, One query - Need Tuning Ideas (SQL2005)
                            
                                SQL Server Check for IsNull and for Zero
                            
                                Cassandra CQL - NoSQL or SQL
                            
                                Simple way to prevent a Divide By Zero error in SQL
                            
                                How to generate date series to occupy absent dates in google BiqQuery?
                            
                                Simple SQL select in C#?
                            
                                BigQuery SQL for 28-day sliding window aggregate (without writing 28 lines of SQL)
                            
                                OR conflict between other conditions
                            
                                Handling Null in Greatest function in Oracle
                            
                                phpMyAdmin error: #1054 - Unknown column 'systeem_eisen' in 'order clause'
                            
                                Find and remove duplicate rows by two columns
                            
                                How to combine two tables in a query
                            
                                Exclude soft deleted items in self referential relationship SQLAlchemy
                            
                                Visual Studio 2010 database edition schema compare where target is dbproj
                            
                                Deleting file created from mysql
                            
                                Clarifying the difference between row-level lock in InnoDB engine and table-level lock in MyISAM engine in MySQL database
                            
                                Why would wrapping a TSQL query in an if statement increase its runtime significantly?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Merging duplicated records together with "Merge" syntax

Tags:

sql

sql-server

tsql

sql-server-2014

PJD

People also ask

1 Answers

Batman

Recent Activity

Donate For Us