So I'm trying to clean some phone records in a database table.
I've found out how to find exact matches in 2 fields using:
/* DUPLICATE first & last names */
SELECT
`First Name`,
`Last Name`,
COUNT(*) c
FROM phone.contacts
GROUP BY
`Last Name`,
`First Name`
HAVING c > 1;
Wow, great.
I want to expand it further to look at numerous fields to see if a phone number in 1 of 3 phone fields is a duplicate.
So I want to check 3 fields (general mobile
, general phone
, business phone
).
1.to see that they are not empty ('') 2.to see if the data (number) in any of them appears in the other 2 phone fields anywhere in the table.
So pushing my limited SQL past its limit I came up with the following which seems to return records with 3 empty phone fields & also records that don't have duplicate phone numbers.
/* DUPLICATE general & business phone nos */
SELECT
id,
`first name`,
`last name`,
`general mobile`,
`general phone`,
`general email`,
`business phone`,
COUNT(CASE WHEN `general mobile` <> '' THEN 1 ELSE NULL END) as gen_mob,
COUNT(CASE WHEN `general phone` <> '' THEN 1 ELSE NULL END) as gen_phone,
COUNT(CASE WHEN `business phone` <> '' THEN 1 ELSE NULL END) as bus_phone
FROM phone.contacts
GROUP BY
`general mobile`,
`general phone`,
`business phone`
HAVING gen_mob > 1 OR gen_phone > 1 OR bus_phone > 1;
Clearly my logic is flawed & I wondered if someone could point me in the right direction/take pity etc...
Many thanks
The Remove Duplicate rows feature in Power Query also works across multiple columns. You just need to select the columns that need to be distinct. For example, choosing "PO" and "Decision" or choosing, "PO", "Decision" and "Date". Please note that Power Query will keep the First Distinct result instead of Group By.
The first thing you should do shoot the person that named your columns with spaces in them.
Now then, try this:
SELECT DISTINCT
c.id,
c.`first name`,
c.`last name`,
c.`general mobile`,
c.`general phone`,
c.`business phone`
from contacts_test c
join contacts_test c2
on (c.`general mobile`!= '' and c.`general mobile` in (c2.`general phone`, c2.`business phone`))
or (c.`general phone` != '' and c.`general phone` in (c2.`general mobile`, c2.`business phone`))
or (c.`business phone`!= '' and c.`business phone` in (c2.`general mobile`, c2.`general phone`))
See a live demo of this query in SQLFiddle.
Note the extra check for phone != ''
, which is required because the phone numbers are not nullable, so their "unknown" value is blank. Without this check, false matches are returned because of course blank equals blank.
The DISTINCT
keyword was added in case there are multiple other rows that match, which would result in a nxn result set.
In my experience, when cleaning up data, it's much better to have a comprehending view of the data, and a simple way to manage it, than to have a big and bulky query that does all the analysis at once.
You can also, (more-or-less) renormalize the database, using something like:
Create view VContactsWithPhones
as
Select id,
`Last Name` as LastName,
`First Name` as FirstName,
`General Mobile` as Phone,
'General Mobile' as PhoneType
From phone.contacts c
UNION
Select id,
`Last Name`,
`First Name`,
`General Phone`,
'General Phone'
From phone.contacts c
UNION
Select id,
`Last Name`,
`First Name`,
`Business Phone`,
'Business Phone'
From phone.contacts c
This will generate a view with triple the rows of the original table, but with a Phone
column, that can be of one of three types.
You can than easily select from that view:
//empty phones
SELECT *
FROM VContactsWithPhones
Where Phone is null or Phone = ''
//duplicate phones
Select Phone, Count(*)
from VContactsWithPhones
where (Phone is not null and Phone <> '') -- exclude empty values
group by Phone
having count(*) > 1
//duplicate phones belonging to the same ID (double entries)
Select Phone, ID, Count(*)
from VContactsWithPhones
where (Phone is not null and Phone <> '') -- exclude empty values
group by Phone, ID
having count(*) > 1
//duplicate phones belonging to the different ID (duplicate entries)
Select v1.Phone, v1.ID, v1.PhoneType, v2.ID, v2.PhoneType
from VContactsWithPhones v1
inner join VContactsWithPhones v2
on v1.Phone=v2.Phone and v1.ID=v2.ID
where v1.Phone is not null and v1.Phone <> ''
etc, etc...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With