Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MYSQL Deduplicate and remove the duplicate row with least data

Tags:

mysql

I am working on a MYSQL database which has the following three columns: emails, name, surname.

What I need to do is deduplicate the emails where I know I can use a function such as this one (this query just to sort not delete):

select distinct emails, name, surname from emails;

or

select emails, name, surname from emails group by emails having count(*) >= 2;

However I also need to make sure that when there a duplicate email address is found that the one kept is the one that has a name and/or surname value.

For example:

|id | emails              | name | surname |
|1 | [email protected] | bob    | paulson   |
|2 | [email protected] |           |                 |

In this case I would like to keep the first result and delete the second.

I have been looking into using 'case' or 'if' statements but am not experienced with using those. I tried expanding the above functions with those statements but to no avail.

Could anyone point me in the right direction?

PS: The first column in the table is an auto-incremented id value, in case that helps

UPDATE 1: So far @Bohemian answer below is working great but fails in one case where there is a duplicate emails address where in one row it has a name but no surname and in the next row it has no name but has a surname. It will keep both records. All that needs to be edited is so that one of these two records gets deleted, no matter which.

UPDATE 2: @Bohemian's answer is great, but after more testing I've found that it has a fundamental flaw in that it works only when there is a duplicate email row where the name and surname fields have data (like the first entry in the table above). If there are duplicates of an email but none of the rows have both the name and surname fields filled in then all those rows will be ignored and not deduplicated.

The last step for this query would be to work out how to delete the duplicates that don't meet the current necessary conditions. If one row has just name and the other just surname, it really doesn't matter which gets deleted as the email is the important thing to keep.

like image 338
psychedelus Avatar asked Aug 01 '13 09:08

psychedelus


1 Answers

You could use this DELETE query, which is generic and can be easily adapted to support more fields:

DELETE tablename.*
FROM
  tablename LEFT JOIN (
    SELECT MIN(id) min_id
    FROM
      tablename t INNER JOIN (
        SELECT
          emails, MAX((name IS NOT NULL) + (surname IS NOT NULL)) max_non_nulls
        FROM
          tablename
        GROUP BY
          emails) m
      ON t.emails=m.emails
         AND ((t.name IS NOT NULL) + (t.surname IS NOT NULL))=m.max_non_nulls
    GROUP BY
      t.emails) ids
  ON tablename.id=ids.min_id
WHERE
  ids.min_id IS NULL

Please see fiddle here.

This query returns the maximum number of non null fields, for every email:

SELECT
  emails,
  MAX((name IS NOT NULL) + (surname IS NOT NULL)) max_non_nulls
FROM
  tablename
GROUP BY
  emails

I'm then joining this query with tablename, to get the minimum ID for every email that has the maximum number of non null fields:

SELECT MIN(id) min_id
FROM
  tablename t INNER JOIN (
    SELECT
      emails, MAX((name IS NOT NULL) + (surname IS NOT NULL)) max_non_nulls
    FROM
      tablename
    GROUP BY
      emails) m
  ON t.emails=m.emails
     AND ((t.name IS NOT NULL) + (t.surname IS NOT NULL))=m.max_non_nulls
GROUP BY
  t.emails

and then I'm deleting all rows that have an ID that is not returned by this query.

like image 96
fthiella Avatar answered Oct 14 '22 08:10

fthiella