I am working on a MYSQL database which has the following three columns: emails, name, surname.
What I need to do is deduplicate the emails where I know I can use a function such as this one (this query just to sort not delete):
select distinct emails, name, surname from emails;
or
select emails, name, surname from emails group by emails having count(*) >= 2;
However I also need to make sure that when there a duplicate email address is found that the one kept is the one that has a name and/or surname value.
For example:
|id | emails | name | surname |
|1 | [email protected] | bob | paulson |
|2 | [email protected] | | |
In this case I would like to keep the first result and delete the second.
I have been looking into using 'case' or 'if' statements but am not experienced with using those. I tried expanding the above functions with those statements but to no avail.
Could anyone point me in the right direction?
PS: The first column in the table is an auto-incremented id value, in case that helps
UPDATE 1: So far @Bohemian answer below is working great but fails in one case where there is a duplicate emails address where in one row it has a name but no surname and in the next row it has no name but has a surname. It will keep both records. All that needs to be edited is so that one of these two records gets deleted, no matter which.
UPDATE 2: @Bohemian's answer is great, but after more testing I've found that it has a fundamental flaw in that it works only when there is a duplicate email row where the name and surname fields have data (like the first entry in the table above). If there are duplicates of an email but none of the rows have both the name and surname fields filled in then all those rows will be ignored and not deduplicated.
The last step for this query would be to work out how to delete the duplicates that don't meet the current necessary conditions. If one row has just name and the other just surname, it really doesn't matter which gets deleted as the email is the important thing to keep.
You could use this DELETE query, which is generic and can be easily adapted to support more fields:
DELETE tablename.*
FROM
tablename LEFT JOIN (
SELECT MIN(id) min_id
FROM
tablename t INNER JOIN (
SELECT
emails, MAX((name IS NOT NULL) + (surname IS NOT NULL)) max_non_nulls
FROM
tablename
GROUP BY
emails) m
ON t.emails=m.emails
AND ((t.name IS NOT NULL) + (t.surname IS NOT NULL))=m.max_non_nulls
GROUP BY
t.emails) ids
ON tablename.id=ids.min_id
WHERE
ids.min_id IS NULL
Please see fiddle here.
This query returns the maximum number of non null fields, for every email:
SELECT
emails,
MAX((name IS NOT NULL) + (surname IS NOT NULL)) max_non_nulls
FROM
tablename
GROUP BY
emails
I'm then joining this query with tablename, to get the minimum ID for every email that has the maximum number of non null fields:
SELECT MIN(id) min_id
FROM
tablename t INNER JOIN (
SELECT
emails, MAX((name IS NOT NULL) + (surname IS NOT NULL)) max_non_nulls
FROM
tablename
GROUP BY
emails) m
ON t.emails=m.emails
AND ((t.name IS NOT NULL) + (t.surname IS NOT NULL))=m.max_non_nulls
GROUP BY
t.emails
and then I'm deleting all rows that have an ID that is not returned by this query.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With