Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging duplicated records together with "Merge" syntax

I am using SQL Server 2014. I am currently trying to combine millions of personnel application records in to a single personnel record.

The records contain the following columns:

ID, First_Name, Last_Name, DOB, Post_Code, Mobile, Email

A person can enter their details numerous times but due to fat fingers or fraud they can sometimes put in, incorrect details.

In my example Christopher has filled his details in 5 times, First_Name, Last_Name, DOB are always correct, Post_Code, Mobile and Email contain various connotations.

What I want to do is take the min(id) associated with this group in this case 84015283 and put it in to a new table, this will be the primary key and then you will see the other id's that are associated with it.

Examples

NID       CID
------------------
84015283  84015283
84015283  84069198
84015283  84070263
84015283  84369603
84015283  85061159

Where it gets a little complicated is, where 2 different people can have the same First_Name, Last_Name and DOB, at least one of the other fields must match "post_code, mobile or email" as per my example to another record within the group.

Though first_name, last_name, DoB match between ID's 84015283, 84069198, 84070263. 84015283, 84069198 are identical so they would match without an issue, 84070263 matches on the postcode, 84369603 matches on the mobile to a previous record and 85061159 matches on a previous mobile/email but not post_code.

If putting the NID within the original dataset is easier I can go with this rather than putting it all in a separate table.

After some googling and trying to get my head around this, I believe that using "Merge" might be a good way to achieve what I am after but I am concerned it will take a very long time due to the number of records involved.

Also going forward any routine would have to be run on subsequent new records.

I have listed the code for the example if anyone can help

DROP TABLE customer_dist

CREATE TABLE [dbo].customer_dist
(
    [id] [int] NOT NULL,
    [First_Name] [varchar](50) NULL,
    [Last_Name] [varchar](50) NULL,
    [DoB] [date] NULL,
    [post_code] [varchar](50) NULL,
    [mobile] [varchar](50) NULL,
    [Email] [varchar](100) NULL,
)

INSERT INTO customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', '[email protected]'),
       ('84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', '[email protected]'),
       ('84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559822', '[email protected]'),
       ('84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA', '07089559829', '[email protected]'),
       ('85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA', '07089559829', '[email protected]'),
       ('87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', '[email protected]')

SELECT * FROM customer_dist

Below is the expected results, sorry I should of made it clearer what I wanted at the end.

Output Table Results

    NID         id          First_Name  Last_Name   DoB         post_code   mobile          Email
    84015283    84015283    Christopher Higg            1/13/1956   CH2 3AZ         7089559829  [email protected]
    84015283    84069198    Christopher Higg            1/13/1956   CH2 3AZ         7089559829  [email protected]
    84015283    84070263    Christopher Higg            1/13/1956   CH2 3AZ         7089559822  [email protected]
    84015283    84369603    Christopher Higg            1/13/1956   CH2 3ZA         7089559829  [email protected]
    84015283    85061159    CHRISTOPHER Higg            1/13/1956   CH2 3RA         7089559829  [email protected]
    78065122    87065122    Matthew Davis               05/10/1978  CH5 1TS

7077084692 [email protected]

OR                          

NID         id
84015283    84015283
84015283    84069198
84015283    84070263
84015283    84369603
84015283    85061159
87065122    87065122

Apologies for the slow response.

I have updated my required output, I was asked to include an extra record that was not a match to the other records but did not include this in my required output.

HABO's response was the closest to what was needed unfortunately on further testing with other sample data, duplicates were created and the logic broke down. Other Sample data would be :-

declare @customer_dist as Table (
    [id] [int] NOT NULL,
    [First_Name] [varchar](50) NULL,
    [Last_Name] [varchar](50) NULL,
    [DoB] [date] NULL,
    [post_code] [varchar](50) NULL,
    [mobile] [varchar](50) NULL,
    [Email] [varchar](100) NULL );


INSERT INTO @customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('32006455', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07706212920',  '[email protected]'),
       ('35963960', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07484863324',  '[email protected]'),
       ('38627975', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07484863478',  '[email protected]'),
       ('46653041', 'Mary', 'WILSON',   '1983-09-20',   'BT62JA',   '07483888179',  '[email protected]'),
       ('48023677', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07483888179',  '[email protected]'),
       ('49560434', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07849727199',  '[email protected]'),
       ('49861032', 'Mary', 'WILSON',   '1983-09-20',   'BT62JA',   '07849727199',  '[email protected]'),
       ('53130969', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07849727199',  '[email protected]'),
       ('33843283', 'Mary', 'Wilson',   '1983-09-20',   'BT148HU',  '07484863478',  '[email protected]'),
       ('38627975', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07484863478',  '[email protected]')

SELECT * FROM @customer_dist;
like image 824
PJD Avatar asked Aug 22 '18 14:08

PJD


People also ask

How do I MERGE duplicate records?

Select the duplicate records, and then click Merge. In the Merge Records dialog box, select the master record (the one you want to keep), and then select any fields in the new record that you want to merge into the master record. Data in these fields may override the existing data in the master record. Click OK.

How do I MERGE duplicate cells?

Combine Duplicate Rows Combine the data and get a total sales amount for every product in a separate range, next to the original one. Select the cell where you want to get a new data range (e.g., E1), and in the Ribbon, go to Data > Consolidate.

How do you group duplicate values together in Excel?

➤ Click on Sort to Expand it and select Sort Largest to Smallest. It will rearrange the rows of your dataset and will put the duplicates together. So, the duplicates will be grouped together.


1 Answers

I used to work in a very old school insurance company where they had similar issue with data.

My main attempt here is to narrow the result set with duplicates in it, kind of finding what ties dups together. Once you get this, the rest of the solutions comes really fast.

Logic is: join table to itself base on columns that share the same values (Fname, Lname, DOB) and occasionally same values (post_code, mobile, email) and more importantly ids should NOT match, this makes sure to exclude non-dup records and leave only dups.

After you have dups only, find MIN(id) and put it inside cte, join to the original table and there you are. Non-dup records don't need min-id because it's id is min-id.

;WITH DUPS AS
(
SELECT DISTINCT
    MIN(C1.ID) OVER(PARTITION BY C1.First_Name, 
C1.Last_Name, C1.DoB) AS minid,
    C1.id, C1.First_Name, C1.Last_Name, C1.DoB
FROM customer_dist c1 
INNER join customer_dist c2 
ON
c1.First_Name = c2.First_Name
AND c1.Last_Name = c2.Last_Name
AND c1.DoB = c2.DoB
AND (c1.post_code = c2.post_code OR c1.mobile = c2.mobile 
OR 
c1.Email = c2.Email)
AND C1.ID <> C2.ID
)

SELECT ISNULL(D.minid, C.ID) AS NID,
        C.*
FROM customer_dist C
LEFT JOIN DUPS D ON C.id = D.id
like image 112
Batman Avatar answered Oct 14 '22 19:10

Batman