MySQL: Eliminating duplicate rows without breaking a foreign key constraint

Tags:

I've got a customer database filled with normalized addresses. There are duplicates.

Each user created their own record, and entered their own address. So we have a 1-to-1 relationship between the users and the addresses:

Click to copy

CREATE TABLE `users` (
  `UserID` INT UNSIGNED NOT NULL AUTO_INCREMENT,
  `Name` VARCHAR(63),
  `Email` VARCHAR(63),
  `AddressID` INT UNSIGNED,
  PRIMARY KEY (`UserID`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `addresses` (
  `AddressID` INT UNSIGNED NOT NULL AUTO_INCREMENT,
  `Duplicate` VARCHAR(1),
  `Address1` VARCHAR(63) DEFAULT NULL,
  `Address2` VARCHAR(63) DEFAULT NULL,
  `City` VARCHAR(63) DEFAULT NULL,
  `State` VARCHAR(2) DEFAULT NULL,
  `ZIP` VARCHAR(10) DEFAULT NULL,
  PRIMARY KEY (`AddressID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

And the data:

Click to copy

INSERT INTO `users` VALUES
    (1,  'Michael', 'michael@email.com', 1),
    (2,  'Steve',   'steve@email.com',   2),
    (3,  'Judy',    'judy@email.com',    3),
    (4,  'Kathy',   'kathy@email.com',   4),
    (5,  'Mark',    'mark@email.com',    5),
    (6,  'Robert',  'robert@email.com',  6),
    (7,  'Susan',   'susan@email.com',   7),
    (8,  'Paul',    'paul@email.com',    8),
    (9,  'Patrick', 'patrick@email.com', 9),
    (10, 'Mary',    'mary@email.com',    10),
    (11, 'James',   'james@email.com',   11),
    (12, 'Barbara', 'barbara@email.com', 12),
    (13, 'Peter',   'peter@email.com',   13);


INSERT INTO `addresses` VALUES
    (1,  '',  '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (2,  'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (3,  'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (4,  '',  '5678 Sycamore Lane', '',      'Upstate',     'NY', '50000'),
    (5,  '',  '1000 State Street',  'Apt C', 'Sunnydale',   'OH', '54321'),
    (6,  'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (7,  'Y', '1000 State Street',  'Apt C', 'Sunnydale',   'OH', '54321'),
    (8,  'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (9,  '',  '1000 State Street',  'Apt A', 'Sunnydale',   'OH', '54321'),
    (10, 'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (11, 'Y', '5678 Sycamore Lane', '',      'Upstate',     'NY', '50000'),
    (12, 'Y', '1000 Main Street',   'Apt A', 'Sunnydale',   'OH', '54321'),
    (13, '',  '9999 Valleyview',    '',      'Springfield', 'KS', '54321');

Oh yes, let me add in that foreign key relationship:

Click to copy

ALTER TABLE `users` ADD CONSTRAINT `AddressID` 
FOREIGN KEY `AddressID` (`AddressID`)
REFERENCES `addresses` (`AddressID`);

We had our address list scrubbed by a 3rd-party service that normalized the data and indicated where we had duplicates. This is where the Duplicate column came from. If there is a 'Y', it is a duplicate of another address. The primary address is NOT marked as a duplicate, as shown in the sample data.

I obviously want to remove all of the duplicate records, but there are user records that point to them. I need them to point to the version of the address that is NOT a duplicate.

So how can I update the AddressID in users to match the non-duplicate addresses?

The only way I can think to do it is by iterating through all of the data using a high-level language, but I'm fairly sure that MySQL has all the tools required to do something like this in a better way.

Here's what I've tried:

Click to copy

SELECT COUNT(*) as cnt, GROUP_CONCAT(AddressID ORDER BY AddressID) AS ids
FROM addresses
GROUP BY Address1, Address2, City, State, ZIP
HAVING cnt > 1;

+-----+--------------+
| cnt | ids          |
+-----+--------------+
|   2 | 5,7          |
|   6 | 1,2,3,6,8,10 |
|   2 | 4,11         |
+-----+--------------+
3 rows in set (0.00 sec)

From there, I could loop through each result row and do this:

Click to copy

UPDATE `users` SET `AddressID` = 1 WHERE `AddressID` IN (2,3,6,8,10);

But there has got to be a better MySQL-only way, shouldn't there?

When everything is said and done, the data SHOULD look like this:

Click to copy

SELECT * FROM `users`;
+--------+---------+-------------------+-----------+
| UserID | Name    | Email             | AddressID |
+--------+---------+-------------------+-----------+
|      1 | Michael | michael@email.com |         1 |
|      2 | Steve   | steve@email.com   |         1 |
|      3 | Judy    | judy@email.com    |         1 |
|      4 | Kathy   | kathy@email.com   |         4 |
|      5 | Mark    | mark@email.com    |         5 |
|      6 | Robert  | robert@email.com  |         1 |
|      7 | Susan   | susan@email.com   |         5 |
|      8 | Paul    | paul@email.com    |         1 |
|      9 | Patrick | patrick@email.com |         9 |
|     10 | Mary    | mary@email.com    |         1 |
|     11 | James   | james@email.com   |         4 |
|     12 | Barbara | barbara@email.com |         1 |
|     13 | Peter   | peter@email.com   |        13 |
+--------+---------+-------------------+-----------+
13 rows in set (0.00 sec)

SELECT * FROM `addresses`;
+-----------+-----------+--------------------+----------+-------------+-------+-------+
| AddressID | Duplicate | Address1           | Address2 | City        | State | ZIP   |
+-----------+-----------+--------------------+----------+-------------+-------+-------+
|         1 |           | 1234 Main Street   |          | Springfield | KS    | 54321 |
|         4 |           | 5678 Sycamore Lane |          | Upstate     | NY    | 50000 |
|         5 |           | 1000 State Street  | Apt C    | Sunnydale   | OH    | 54321 |
|         9 |           | 1000 State Street  | Apt A    | Sunnydale   | OH    | 54321 |
|        13 |           | 9999 Valleyview    |          | Springfield | KS    | 54321 |
+-----------+-----------+--------------------+----------+-------------+-------+-------+
5 rows in set (0.00 sec)

Help?

513

asked Nov 27 '13 02:11

pbarney

2 Answers

You have a many-to-one relationship between users and addresses (that is multiple users can map to the same address). This seems a bit odd to me, but I suppose it could be useful. Many-to-many would make more sense, i.e. a user can have multiple addresses, but the same address can be shared by multiple users. Generally, a single user has multiple addresses. Updating your schema may help, but I digress.

Click to copy

UPDATE users
-- We only care about users mapped to duplicate addresses
JOIN addresses dupe ON (users.AddressID = dupe.AddressID AND dupe.Duplicate='Y')
-- If your normalizer thingy worked right, these will all map to non-duplicates
JOIN addresses nondupe ON (dupe.Address1 = nondupe.Address1
    -- Compare to other columns if you want
    AND nondupe.Duplicate = '')
-- Set to the nondupe ID
SET users.AddressID = nondupe.AddressID;

http://sqlfiddle.com/#!2/5d303/1

answered Oct 15 '22 08:10

Explosion Pills

To select the results you want to see:

Click to copy

SELECT   a.UserID
        ,a.Name
        ,a.Email
        ,(
            SELECT  addressID 
            FROM    addresses c
            WHERE   c.Address1 = b.Address1
            AND     c.Address2 = b.Address2
            AND     c.City = b.City
            AND     c.State = b.State
            AND     c.ZIP = b.ZIP
            AND     DUPLICATE != 'Y'

        ) as AddressID
FROM    users a
JOIN    addresses b
ON      a.AddressID = b.AddressID

This will update the users table to the results shown in the query above.

Click to copy

UPDATE  users a
JOIN    addresses b
ON      a.AddressID = b.AddressID
SET     a.addressID  = 
        (
            SELECT  addressID 
            FROM    addresses c
            WHERE   c.Address1 = b.Address1
            AND     c.Address2 = b.Address2
            AND     c.City = b.City
            AND     c.State = b.State
            AND     c.ZIP = b.ZIP
            AND     Duplicate != 'Y'
        )
WHERE Duplicate = 'Y'

Note that with the sample data you provided, #12 Barbara's ID is null in the SELECT query since her address is marked as duplicate when in fact it is unique to the list provided. It does not match address 1 as indicated in the "how it should look" results.

Edit

In order to handle incorrect duplicate flags like #12 Barbara, or maybe other missed duplicates that have not marked as such, you can skip the duplicate flag column check and just use ORDER BY & LIMIT on the sub-query so that it will return the first lowest matching address ID, regardless of the duplicate flag:

Click to copy

UPDATE  users a
JOIN    addresses b
ON      a.AddressID = b.AddressID
SET     a.addressID = 
        (
            SELECT      addressID 
            FROM        addresses c
            WHERE       c.Address1 = b.Address1
            AND         c.Address2 = b.Address2
            AND         c.City = b.City
            AND         c.State = b.State
            AND         c.ZIP = b.ZIP
            ORDER BY    c.addressID ASC
            LIMIT       1
        )

answered Oct 15 '22 08:10

WebChemist

Related questions
                            
                                PHP: array of objects - serialize vs json_encode - alternatives?
                            
                                Find out if date is between two dates, ignoring year
                            
                                Create database with PDO bindParam
                            
                                PHP include different version of same library
                            
                                Error with PHPUnit in Symfony2
                            
                                Display table values vertically while keeping table structure
                            
                                Node.js and socket.io for a notification bar : Am I going the right way?
                            
                                How to handle Objects with arrays to access specific data?
                            
                                Using a Service Account, getAccessToken() is returning null
                            
                                Laravel 4 migrate:rollback with --path on artisan CLI
                            
                                URL doesn't work with slash after removing extension using htaccess
                            
                                PHP: mail() function with runtime ini_set() for SMTP and SMTP_PORT not working on Linux
                            
                                How to stream a mjpeg video on a website
                            
                                What does memory_get_peak_usage(true) do? [duplicate]
                            
                                Get possible array combinations
                            
                                Is there a limit like max_input_vars in versions before 5.3.9?
                            
                                MySQL select or update acting very strange
                            
                                htaccess: different rewrite rules for different ip addresses
                            
                                On update, skip certain attributes from updating yii
                            
                                Similar names in a huge list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

MySQL: Eliminating duplicate rows without breaking a foreign key constraint

Tags:

sql

php

mysql

duplicate-removal

normalization

pbarney

People also ask

2 Answers

Explosion Pills

WebChemist

Recent Activity

Donate For Us