I am using SQL Server 2014. I am currently trying to combine millions of personnel application records in to a single personnel record.
The records contain the following columns:
ID, First_Name, Last_Name, DOB, Post_Code, Mobile, Email
A person can enter their details numerous times but due to fat fingers or fraud they can sometimes put in, incorrect details.
In my example Christopher has filled his details in 5 times, First_Name
, Last_Name
, DOB
are always correct, Post_Code
, Mobile
and Email
contain various connotations.
What I want to do is take the min(id) associated with this group in this case 84015283 and put it in to a new table, this will be the primary key and then you will see the other id's that are associated with it.
Examples
NID CID
------------------
84015283 84015283
84015283 84069198
84015283 84070263
84015283 84369603
84015283 85061159
Where it gets a little complicated is, where 2 different people can have the same First_Name
, Last_Name
and DOB
, at least one of the other fields must match "post_code
, mobile
or email
" as per my example to another record within the group.
Though first_name
, last_name
, DoB
match between ID's 84015283, 84069198, 84070263. 84015283, 84069198 are identical so they would match without an issue, 84070263 matches on the postcode, 84369603 matches on the mobile to a previous record and 85061159 matches on a previous mobile/email but not post_code.
If putting the NID within the original dataset is easier I can go with this rather than putting it all in a separate table.
After some googling and trying to get my head around this, I believe that using "Merge" might be a good way to achieve what I am after but I am concerned it will take a very long time due to the number of records involved.
Also going forward any routine would have to be run on subsequent new records.
I have listed the code for the example if anyone can help
DROP TABLE customer_dist
CREATE TABLE [dbo].customer_dist
(
[id] [int] NOT NULL,
[First_Name] [varchar](50) NULL,
[Last_Name] [varchar](50) NULL,
[DoB] [date] NULL,
[post_code] [varchar](50) NULL,
[mobile] [varchar](50) NULL,
[Email] [varchar](100) NULL,
)
INSERT INTO customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', '[email protected]'),
('84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', '[email protected]'),
('84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559822', '[email protected]'),
('84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA', '07089559829', '[email protected]'),
('85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA', '07089559829', '[email protected]'),
('87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', '[email protected]')
SELECT * FROM customer_dist
Below is the expected results, sorry I should of made it clearer what I wanted at the end.
Output Table Results
NID id First_Name Last_Name DoB post_code mobile Email
84015283 84015283 Christopher Higg 1/13/1956 CH2 3AZ 7089559829 [email protected]
84015283 84069198 Christopher Higg 1/13/1956 CH2 3AZ 7089559829 [email protected]
84015283 84070263 Christopher Higg 1/13/1956 CH2 3AZ 7089559822 [email protected]
84015283 84369603 Christopher Higg 1/13/1956 CH2 3ZA 7089559829 [email protected]
84015283 85061159 CHRISTOPHER Higg 1/13/1956 CH2 3RA 7089559829 [email protected]
78065122 87065122 Matthew Davis 05/10/1978 CH5 1TS
7077084692 [email protected]
OR
NID id
84015283 84015283
84015283 84069198
84015283 84070263
84015283 84369603
84015283 85061159
87065122 87065122
Apologies for the slow response.
I have updated my required output, I was asked to include an extra record that was not a match to the other records but did not include this in my required output.
HABO's response was the closest to what was needed unfortunately on further testing with other sample data, duplicates were created and the logic broke down. Other Sample data would be :-
declare @customer_dist as Table (
[id] [int] NOT NULL,
[First_Name] [varchar](50) NULL,
[Last_Name] [varchar](50) NULL,
[DoB] [date] NULL,
[post_code] [varchar](50) NULL,
[mobile] [varchar](50) NULL,
[Email] [varchar](100) NULL );
INSERT INTO @customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('32006455', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07706212920', '[email protected]'),
('35963960', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07484863324', '[email protected]'),
('38627975', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07484863478', '[email protected]'),
('46653041', 'Mary', 'WILSON', '1983-09-20', 'BT62JA', '07483888179', '[email protected]'),
('48023677', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07483888179', '[email protected]'),
('49560434', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07849727199', '[email protected]'),
('49861032', 'Mary', 'WILSON', '1983-09-20', 'BT62JA', '07849727199', '[email protected]'),
('53130969', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07849727199', '[email protected]'),
('33843283', 'Mary', 'Wilson', '1983-09-20', 'BT148HU', '07484863478', '[email protected]'),
('38627975', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07484863478', '[email protected]')
SELECT * FROM @customer_dist;
Select the duplicate records, and then click Merge. In the Merge Records dialog box, select the master record (the one you want to keep), and then select any fields in the new record that you want to merge into the master record. Data in these fields may override the existing data in the master record. Click OK.
Combine Duplicate Rows Combine the data and get a total sales amount for every product in a separate range, next to the original one. Select the cell where you want to get a new data range (e.g., E1), and in the Ribbon, go to Data > Consolidate.
➤ Click on Sort to Expand it and select Sort Largest to Smallest. It will rearrange the rows of your dataset and will put the duplicates together. So, the duplicates will be grouped together.
I used to work in a very old school insurance company where they had similar issue with data.
My main attempt here is to narrow the result set with duplicates in it, kind of finding what ties dups together. Once you get this, the rest of the solutions comes really fast.
Logic is: join table to itself base on columns that share the same values (Fname, Lname, DOB) and occasionally same values (post_code, mobile, email) and more importantly ids should NOT match, this makes sure to exclude non-dup records and leave only dups.
After you have dups only, find MIN(id) and put it inside cte, join to the original table and there you are. Non-dup records don't need min-id because it's id is min-id.
;WITH DUPS AS
(
SELECT DISTINCT
MIN(C1.ID) OVER(PARTITION BY C1.First_Name,
C1.Last_Name, C1.DoB) AS minid,
C1.id, C1.First_Name, C1.Last_Name, C1.DoB
FROM customer_dist c1
INNER join customer_dist c2
ON
c1.First_Name = c2.First_Name
AND c1.Last_Name = c2.Last_Name
AND c1.DoB = c2.DoB
AND (c1.post_code = c2.post_code OR c1.mobile = c2.mobile
OR
c1.Email = c2.Email)
AND C1.ID <> C2.ID
)
SELECT ISNULL(D.minid, C.ID) AS NID,
C.*
FROM customer_dist C
LEFT JOIN DUPS D ON C.id = D.id
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With