Deduping database records comparing values in numerous fields

So I'm trying to clean some phone records in a database table.

I've found out how to find exact matches in 2 fields using:

/* DUPLICATE first & last names */

SELECT 
    `First Name`, 
    `Last Name`, 
     COUNT(*) c 
FROM phone.contacts  
GROUP BY 
    `Last Name`, 
    `First Name` 
HAVING c > 1;

Wow, great.

I want to expand it further to look at numerous fields to see if a phone number in 1 of 3 phone fields is a duplicate.

So I want to check 3 fields (general mobile, general phone, business phone).

1.to see that they are not empty ('') 2.to see if the data (number) in any of them appears in the other 2 phone fields anywhere in the table.

So pushing my limited SQL past its limit I came up with the following which seems to return records with 3 empty phone fields & also records that don't have duplicate phone numbers.

/* DUPLICATE general & business phone nos */

SELECT 
    id, 
   `first name`, 
   `last name`, 
   `general mobile`, 
   `general phone`, 
   `general email`, 
   `business phone`, 
    COUNT(CASE WHEN `general mobile` <> '' THEN 1 ELSE NULL END) as gen_mob, 
    COUNT(CASE WHEN `general phone` <> '' THEN 1 ELSE NULL END) as gen_phone,
    COUNT(CASE WHEN `business phone` <> '' THEN 1 ELSE NULL END) as bus_phone 
FROM phone.contacts 
GROUP BY 
   `general mobile`, 
   `general phone`, 
   `business phone` 
HAVING gen_mob > 1 OR gen_phone > 1 OR bus_phone > 1;

Clearly my logic is flawed & I wondered if someone could point me in the right direction/take pity etc...

Many thanks

How do you remove duplicates based on multiple columns in power query?

The Remove Duplicate rows feature in Power Query also works across multiple columns. You just need to select the columns that need to be distinct. For example, choosing "PO" and "Decision" or choosing, "PO", "Decision" and "Date". Please note that Power Query will keep the First Distinct result instead of Group By.

The first thing you should do shoot the person that named your columns with spaces in them.

Now then, try this:

SELECT DISTINCT
   c.id, 
   c.`first name`, 
   c.`last name`, 
   c.`general mobile`, 
   c.`general phone`, 
   c.`business phone`
from contacts_test c
join contacts_test c2
    on (c.`general mobile`!= '' and c.`general mobile` in (c2.`general phone`, c2.`business phone`))
    or (c.`general phone` != '' and c.`general phone` in (c2.`general mobile`, c2.`business phone`))
    or (c.`business phone`!= '' and c.`business phone` in (c2.`general mobile`, c2.`general phone`))

See a live demo of this query in SQLFiddle.

Note the extra check for phone != '', which is required because the phone numbers are not nullable, so their "unknown" value is blank. Without this check, false matches are returned because of course blank equals blank.

The DISTINCT keyword was added in case there are multiple other rows that match, which would result in a nxn result set.

In my experience, when cleaning up data, it's much better to have a comprehending view of the data, and a simple way to manage it, than to have a big and bulky query that does all the analysis at once.

You can also, (more-or-less) renormalize the database, using something like:

Create view VContactsWithPhones
as
Select id, 
       `Last Name` as LastName, 
       `First Name` as FirstName,
       `General Mobile` as Phone,
       'General Mobile' as PhoneType
From phone.contacts c
UNION
Select id, 
       `Last Name`, 
       `First Name`,
       `General Phone`,
       'General Phone'
From phone.contacts c
UNION
Select id, 
       `Last Name`, 
       `First Name`,
       `Business Phone`,
       'Business Phone'
From phone.contacts c

This will generate a view with triple the rows of the original table, but with a Phone column, that can be of one of three types.

You can than easily select from that view:

//empty phones
SELECT * 
FROM VContactsWithPhones 
Where Phone is null or Phone = ''

//duplicate phones
Select Phone, Count(*)
from VContactsWithPhones 
where (Phone is not null and Phone <> '')  -- exclude empty values
group by Phone
having count(*) > 1

//duplicate phones belonging to the same ID (double entries)
Select Phone, ID, Count(*)
from VContactsWithPhones 
where (Phone is not null and Phone <> '')  -- exclude empty values
group by Phone, ID
having count(*) > 1

//duplicate phones belonging to the different ID (duplicate entries)
Select v1.Phone, v1.ID, v1.PhoneType, v2.ID, v2.PhoneType
from VContactsWithPhones v1
   inner join VContactsWithPhones v2 
     on v1.Phone=v2.Phone and v1.ID=v2.ID
where v1.Phone is not null and v1.Phone <> ''

etc, etc...

Deduping database records comparing values in numerous fields

Tags:

sql

mysql

deduplication

Still_Learning

People also ask

2 Answers

Bohemian

SWeko

Recent Activity

Donate For Us

Deduping database records comparing values in numerous fields

Tags:

sql

mysql

deduplication

Still_Learning

People also ask

2 Answers

Bohemian

SWeko

Related questions

Recent Activity

Donate For Us