I've got a customer database filled with normalized addresses. There are duplicates.
Each user created their own record, and entered their own address. So we have a 1-to-1 relationship between the users and the addresses:
CREATE TABLE `users` (
`UserID` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`Name` VARCHAR(63),
`Email` VARCHAR(63),
`AddressID` INT UNSIGNED,
PRIMARY KEY (`UserID`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `addresses` (
`AddressID` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`Duplicate` VARCHAR(1),
`Address1` VARCHAR(63) DEFAULT NULL,
`Address2` VARCHAR(63) DEFAULT NULL,
`City` VARCHAR(63) DEFAULT NULL,
`State` VARCHAR(2) DEFAULT NULL,
`ZIP` VARCHAR(10) DEFAULT NULL,
PRIMARY KEY (`AddressID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
And the data:
INSERT INTO `users` VALUES
(1, 'Michael', '[email protected]', 1),
(2, 'Steve', '[email protected]', 2),
(3, 'Judy', '[email protected]', 3),
(4, 'Kathy', '[email protected]', 4),
(5, 'Mark', '[email protected]', 5),
(6, 'Robert', '[email protected]', 6),
(7, 'Susan', '[email protected]', 7),
(8, 'Paul', '[email protected]', 8),
(9, 'Patrick', '[email protected]', 9),
(10, 'Mary', '[email protected]', 10),
(11, 'James', '[email protected]', 11),
(12, 'Barbara', '[email protected]', 12),
(13, 'Peter', '[email protected]', 13);
INSERT INTO `addresses` VALUES
(1, '', '1234 Main Street', '', 'Springfield', 'KS', '54321'),
(2, 'Y', '1234 Main Street', '', 'Springfield', 'KS', '54321'),
(3, 'Y', '1234 Main Street', '', 'Springfield', 'KS', '54321'),
(4, '', '5678 Sycamore Lane', '', 'Upstate', 'NY', '50000'),
(5, '', '1000 State Street', 'Apt C', 'Sunnydale', 'OH', '54321'),
(6, 'Y', '1234 Main Street', '', 'Springfield', 'KS', '54321'),
(7, 'Y', '1000 State Street', 'Apt C', 'Sunnydale', 'OH', '54321'),
(8, 'Y', '1234 Main Street', '', 'Springfield', 'KS', '54321'),
(9, '', '1000 State Street', 'Apt A', 'Sunnydale', 'OH', '54321'),
(10, 'Y', '1234 Main Street', '', 'Springfield', 'KS', '54321'),
(11, 'Y', '5678 Sycamore Lane', '', 'Upstate', 'NY', '50000'),
(12, 'Y', '1000 Main Street', 'Apt A', 'Sunnydale', 'OH', '54321'),
(13, '', '9999 Valleyview', '', 'Springfield', 'KS', '54321');
Oh yes, let me add in that foreign key relationship:
ALTER TABLE `users` ADD CONSTRAINT `AddressID`
FOREIGN KEY `AddressID` (`AddressID`)
REFERENCES `addresses` (`AddressID`);
We had our address list scrubbed by a 3rd-party service that normalized the data and indicated where we had duplicates. This is where the Duplicate
column came from. If there is a 'Y', it is a duplicate of another address. The primary address is NOT marked as a duplicate, as shown in the sample data.
I obviously want to remove all of the duplicate records, but there are user records that point to them. I need them to point to the version of the address that is NOT a duplicate.
So how can I update the AddressID
in users
to match the non-duplicate addresses?
The only way I can think to do it is by iterating through all of the data using a high-level language, but I'm fairly sure that MySQL has all the tools required to do something like this in a better way.
Here's what I've tried:
SELECT COUNT(*) as cnt, GROUP_CONCAT(AddressID ORDER BY AddressID) AS ids
FROM addresses
GROUP BY Address1, Address2, City, State, ZIP
HAVING cnt > 1;
+-----+--------------+
| cnt | ids |
+-----+--------------+
| 2 | 5,7 |
| 6 | 1,2,3,6,8,10 |
| 2 | 4,11 |
+-----+--------------+
3 rows in set (0.00 sec)
From there, I could loop through each result row and do this:
UPDATE `users` SET `AddressID` = 1 WHERE `AddressID` IN (2,3,6,8,10);
But there has got to be a better MySQL-only way, shouldn't there?
When everything is said and done, the data SHOULD look like this:
SELECT * FROM `users`;
+--------+---------+-------------------+-----------+
| UserID | Name | Email | AddressID |
+--------+---------+-------------------+-----------+
| 1 | Michael | [email protected] | 1 |
| 2 | Steve | [email protected] | 1 |
| 3 | Judy | [email protected] | 1 |
| 4 | Kathy | [email protected] | 4 |
| 5 | Mark | [email protected] | 5 |
| 6 | Robert | [email protected] | 1 |
| 7 | Susan | [email protected] | 5 |
| 8 | Paul | [email protected] | 1 |
| 9 | Patrick | [email protected] | 9 |
| 10 | Mary | [email protected] | 1 |
| 11 | James | [email protected] | 4 |
| 12 | Barbara | [email protected] | 1 |
| 13 | Peter | [email protected] | 13 |
+--------+---------+-------------------+-----------+
13 rows in set (0.00 sec)
SELECT * FROM `addresses`;
+-----------+-----------+--------------------+----------+-------------+-------+-------+
| AddressID | Duplicate | Address1 | Address2 | City | State | ZIP |
+-----------+-----------+--------------------+----------+-------------+-------+-------+
| 1 | | 1234 Main Street | | Springfield | KS | 54321 |
| 4 | | 5678 Sycamore Lane | | Upstate | NY | 50000 |
| 5 | | 1000 State Street | Apt C | Sunnydale | OH | 54321 |
| 9 | | 1000 State Street | Apt A | Sunnydale | OH | 54321 |
| 13 | | 9999 Valleyview | | Springfield | KS | 54321 |
+-----------+-----------+--------------------+----------+-------------+-------+-------+
5 rows in set (0.00 sec)
Help?
DELETE FROM [table_name] WHERE row_number > 1; In our example dates table, the command would be: DELETE FROM dates WHERE row_number > 1; The output will tell you how many rows have been affected, that is, how many duplicate rows have been deleted.
Short answer: Yes, it can be NULL or duplicate. I want to explain why a foreign key might need to be null or might need to be unique or not unique. First remember a Foreign key simply requires that the value in that field must exist first in a different table (the parent table).
1) First identify the rows those satisfy the definition of duplicate and insert them into temp table, say #tableAll . 2) Select non-duplicate(single-rows) or distinct rows into temp table say #tableUnique. 3) Delete from source table joining #tableAll to delete the duplicates.
You have a many-to-one relationship between users and addresses (that is multiple users can map to the same address). This seems a bit odd to me, but I suppose it could be useful. Many-to-many would make more sense, i.e. a user can have multiple addresses, but the same address can be shared by multiple users. Generally, a single user has multiple addresses. Updating your schema may help, but I digress.
UPDATE users
-- We only care about users mapped to duplicate addresses
JOIN addresses dupe ON (users.AddressID = dupe.AddressID AND dupe.Duplicate='Y')
-- If your normalizer thingy worked right, these will all map to non-duplicates
JOIN addresses nondupe ON (dupe.Address1 = nondupe.Address1
-- Compare to other columns if you want
AND nondupe.Duplicate = '')
-- Set to the nondupe ID
SET users.AddressID = nondupe.AddressID;
http://sqlfiddle.com/#!2/5d303/1
To select the results you want to see:
SELECT a.UserID
,a.Name
,a.Email
,(
SELECT addressID
FROM addresses c
WHERE c.Address1 = b.Address1
AND c.Address2 = b.Address2
AND c.City = b.City
AND c.State = b.State
AND c.ZIP = b.ZIP
AND DUPLICATE != 'Y'
) as AddressID
FROM users a
JOIN addresses b
ON a.AddressID = b.AddressID
This will update the users table to the results shown in the query above.
UPDATE users a
JOIN addresses b
ON a.AddressID = b.AddressID
SET a.addressID =
(
SELECT addressID
FROM addresses c
WHERE c.Address1 = b.Address1
AND c.Address2 = b.Address2
AND c.City = b.City
AND c.State = b.State
AND c.ZIP = b.ZIP
AND Duplicate != 'Y'
)
WHERE Duplicate = 'Y'
Note that with the sample data you provided, #12 Barbara's ID is null in the SELECT
query since her address is marked as duplicate when in fact it is unique to the list provided. It does not match address 1 as indicated in the "how it should look" results.
Edit
In order to handle incorrect duplicate flags like #12 Barbara, or maybe other missed duplicates that have not marked as such, you can skip the duplicate flag column check and just use ORDER BY
& LIMIT
on the sub-query so that it will return the first lowest matching address ID, regardless of the duplicate flag:
UPDATE users a
JOIN addresses b
ON a.AddressID = b.AddressID
SET a.addressID =
(
SELECT addressID
FROM addresses c
WHERE c.Address1 = b.Address1
AND c.Address2 = b.Address2
AND c.City = b.City
AND c.State = b.State
AND c.ZIP = b.ZIP
ORDER BY c.addressID ASC
LIMIT 1
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With