Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL: Eliminating duplicate rows without breaking a foreign key constraint

I've got a customer database filled with normalized addresses. There are duplicates.

Each user created their own record, and entered their own address. So we have a 1-to-1 relationship between the users and the addresses:

CREATE TABLE `users` (
  `UserID` INT UNSIGNED NOT NULL AUTO_INCREMENT,
  `Name` VARCHAR(63),
  `Email` VARCHAR(63),
  `AddressID` INT UNSIGNED,
  PRIMARY KEY (`UserID`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `addresses` (
  `AddressID` INT UNSIGNED NOT NULL AUTO_INCREMENT,
  `Duplicate` VARCHAR(1),
  `Address1` VARCHAR(63) DEFAULT NULL,
  `Address2` VARCHAR(63) DEFAULT NULL,
  `City` VARCHAR(63) DEFAULT NULL,
  `State` VARCHAR(2) DEFAULT NULL,
  `ZIP` VARCHAR(10) DEFAULT NULL,
  PRIMARY KEY (`AddressID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;  

And the data:

INSERT INTO `users` VALUES
    (1,  'Michael', '[email protected]', 1),
    (2,  'Steve',   '[email protected]',   2),
    (3,  'Judy',    '[email protected]',    3),
    (4,  'Kathy',   '[email protected]',   4),
    (5,  'Mark',    '[email protected]',    5),
    (6,  'Robert',  '[email protected]',  6),
    (7,  'Susan',   '[email protected]',   7),
    (8,  'Paul',    '[email protected]',    8),
    (9,  'Patrick', '[email protected]', 9),
    (10, 'Mary',    '[email protected]',    10),
    (11, 'James',   '[email protected]',   11),
    (12, 'Barbara', '[email protected]', 12),
    (13, 'Peter',   '[email protected]',   13);


INSERT INTO `addresses` VALUES
    (1,  '',  '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (2,  'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (3,  'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (4,  '',  '5678 Sycamore Lane', '',      'Upstate',     'NY', '50000'),
    (5,  '',  '1000 State Street',  'Apt C', 'Sunnydale',   'OH', '54321'),
    (6,  'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (7,  'Y', '1000 State Street',  'Apt C', 'Sunnydale',   'OH', '54321'),
    (8,  'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (9,  '',  '1000 State Street',  'Apt A', 'Sunnydale',   'OH', '54321'),
    (10, 'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (11, 'Y', '5678 Sycamore Lane', '',      'Upstate',     'NY', '50000'),
    (12, 'Y', '1000 Main Street',   'Apt A', 'Sunnydale',   'OH', '54321'),
    (13, '',  '9999 Valleyview',    '',      'Springfield', 'KS', '54321');

Oh yes, let me add in that foreign key relationship:

ALTER TABLE `users` ADD CONSTRAINT `AddressID` 
FOREIGN KEY `AddressID` (`AddressID`)
REFERENCES `addresses` (`AddressID`);

We had our address list scrubbed by a 3rd-party service that normalized the data and indicated where we had duplicates. This is where the Duplicate column came from. If there is a 'Y', it is a duplicate of another address. The primary address is NOT marked as a duplicate, as shown in the sample data.

I obviously want to remove all of the duplicate records, but there are user records that point to them. I need them to point to the version of the address that is NOT a duplicate.

So how can I update the AddressID in users to match the non-duplicate addresses?

The only way I can think to do it is by iterating through all of the data using a high-level language, but I'm fairly sure that MySQL has all the tools required to do something like this in a better way.

Here's what I've tried:

SELECT COUNT(*) as cnt, GROUP_CONCAT(AddressID ORDER BY AddressID) AS ids
FROM addresses
GROUP BY Address1, Address2, City, State, ZIP
HAVING cnt > 1;

+-----+--------------+
| cnt | ids          |
+-----+--------------+
|   2 | 5,7          |
|   6 | 1,2,3,6,8,10 |
|   2 | 4,11         |
+-----+--------------+
3 rows in set (0.00 sec)

From there, I could loop through each result row and do this:

UPDATE `users` SET `AddressID` = 1 WHERE `AddressID` IN (2,3,6,8,10);

But there has got to be a better MySQL-only way, shouldn't there?

When everything is said and done, the data SHOULD look like this:

SELECT * FROM `users`;
+--------+---------+-------------------+-----------+
| UserID | Name    | Email             | AddressID |
+--------+---------+-------------------+-----------+
|      1 | Michael | [email protected] |         1 |
|      2 | Steve   | [email protected]   |         1 |
|      3 | Judy    | [email protected]    |         1 |
|      4 | Kathy   | [email protected]   |         4 |
|      5 | Mark    | [email protected]    |         5 |
|      6 | Robert  | [email protected]  |         1 |
|      7 | Susan   | [email protected]   |         5 |
|      8 | Paul    | [email protected]    |         1 |
|      9 | Patrick | [email protected] |         9 |
|     10 | Mary    | [email protected]    |         1 |
|     11 | James   | [email protected]   |         4 |
|     12 | Barbara | [email protected] |         1 |
|     13 | Peter   | [email protected]   |        13 |
+--------+---------+-------------------+-----------+
13 rows in set (0.00 sec)

SELECT * FROM `addresses`;
+-----------+-----------+--------------------+----------+-------------+-------+-------+
| AddressID | Duplicate | Address1           | Address2 | City        | State | ZIP   |
+-----------+-----------+--------------------+----------+-------------+-------+-------+
|         1 |           | 1234 Main Street   |          | Springfield | KS    | 54321 |
|         4 |           | 5678 Sycamore Lane |          | Upstate     | NY    | 50000 |
|         5 |           | 1000 State Street  | Apt C    | Sunnydale   | OH    | 54321 |
|         9 |           | 1000 State Street  | Apt A    | Sunnydale   | OH    | 54321 |
|        13 |           | 9999 Valleyview    |          | Springfield | KS    | 54321 |
+-----------+-----------+--------------------+----------+-------------+-------+-------+
5 rows in set (0.00 sec)

Help?

like image 513
pbarney Avatar asked Nov 27 '13 02:11

pbarney


People also ask

How do I delete duplicate rows in MySQL?

DELETE FROM [table_name] WHERE row_number > 1; In our example dates table, the command would be: DELETE FROM dates WHERE row_number > 1; The output will tell you how many rows have been affected, that is, how many duplicate rows have been deleted.

Is duplicate allowed in foreign key?

Short answer: Yes, it can be NULL or duplicate. I want to explain why a foreign key might need to be null or might need to be unique or not unique. First remember a Foreign key simply requires that the value in that field must exist first in a different table (the parent table).

How can I delete duplicate rows where no unique row ID exists?

1) First identify the rows those satisfy the definition of duplicate and insert them into temp table, say #tableAll . 2) Select non-duplicate(single-rows) or distinct rows into temp table say #tableUnique. 3) Delete from source table joining #tableAll to delete the duplicates.


2 Answers

You have a many-to-one relationship between users and addresses (that is multiple users can map to the same address). This seems a bit odd to me, but I suppose it could be useful. Many-to-many would make more sense, i.e. a user can have multiple addresses, but the same address can be shared by multiple users. Generally, a single user has multiple addresses. Updating your schema may help, but I digress.

UPDATE users
-- We only care about users mapped to duplicate addresses
JOIN addresses dupe ON (users.AddressID = dupe.AddressID AND dupe.Duplicate='Y')
-- If your normalizer thingy worked right, these will all map to non-duplicates
JOIN addresses nondupe ON (dupe.Address1 = nondupe.Address1
    -- Compare to other columns if you want
    AND nondupe.Duplicate = '')
-- Set to the nondupe ID
SET users.AddressID = nondupe.AddressID;

http://sqlfiddle.com/#!2/5d303/1

like image 53
Explosion Pills Avatar answered Oct 15 '22 08:10

Explosion Pills


To select the results you want to see:

SELECT   a.UserID
        ,a.Name
        ,a.Email
        ,(
            SELECT  addressID 
            FROM    addresses c
            WHERE   c.Address1 = b.Address1
            AND     c.Address2 = b.Address2
            AND     c.City = b.City
            AND     c.State = b.State
            AND     c.ZIP = b.ZIP
            AND     DUPLICATE != 'Y'

        ) as AddressID
FROM    users a
JOIN    addresses b
ON      a.AddressID = b.AddressID

This will update the users table to the results shown in the query above.

UPDATE  users a
JOIN    addresses b
ON      a.AddressID = b.AddressID
SET     a.addressID  = 
        (
            SELECT  addressID 
            FROM    addresses c
            WHERE   c.Address1 = b.Address1
            AND     c.Address2 = b.Address2
            AND     c.City = b.City
            AND     c.State = b.State
            AND     c.ZIP = b.ZIP
            AND     Duplicate != 'Y'
        )
WHERE Duplicate = 'Y'

Note that with the sample data you provided, #12 Barbara's ID is null in the SELECT query since her address is marked as duplicate when in fact it is unique to the list provided. It does not match address 1 as indicated in the "how it should look" results.

Edit

In order to handle incorrect duplicate flags like #12 Barbara, or maybe other missed duplicates that have not marked as such, you can skip the duplicate flag column check and just use ORDER BY & LIMIT on the sub-query so that it will return the first lowest matching address ID, regardless of the duplicate flag:

UPDATE  users a
JOIN    addresses b
ON      a.AddressID = b.AddressID
SET     a.addressID = 
        (
            SELECT      addressID 
            FROM        addresses c
            WHERE       c.Address1 = b.Address1
            AND         c.Address2 = b.Address2
            AND         c.City = b.City
            AND         c.State = b.State
            AND         c.ZIP = b.ZIP
            ORDER BY    c.addressID ASC
            LIMIT       1
        )
like image 1
WebChemist Avatar answered Oct 15 '22 08:10

WebChemist