I'm trying to normalize a table which a previous developer designed to have a column containing pipe-separated IDs which link to other rows in the same table.
Customers Table
id | aliases (VARCHAR)
----------------------------
1 | |4|58|76
2 |
3 |
4 | |1|58|76
... |
58 | |1|4|76
... |
76 | |1|4|58
So customer 1, 4, 58 and 76 are all "aliases" of each other. Customer 2 and 3 have no aliases, so the field contains an empty string.
I want to do away with the entire "alias" system, and normalise the data so I can map those other customers all to the one record. So I want related table data for customer 1, 4, 58, and 76 all to be mapped just to customer 1.
I figured I would populate a new table which later I can then join and perform updates on other tables.
Join Table
id | customer_id | alias_id
-------------------------------
1 | 1 | 4
2 | 1 | 58
3 | 1 | 76
How can I get the data from that first table, into the above format? If this is going to be an absolute nightmare in pure SQL, I will just write a PHP script which attempts to do this work and insert the data.
When I started to answer this question, I thought it would be quick and easy because I'd done something very similar once in SQL Server, but proving out the concept in translation burgeoned into this full solution.
One caveat that wasn't clear from your question is whether you have a condition for declaring the primary id vs the alias id. For instance, this solution will allow 1 to have an alias of 4 as well as 4 to have an alias of 1, which is consistent with the provided data in your simplified example question.
To setup the data for this example, I used this structure:
CREATE TABLE notnormal_customers (
id INT NOT NULL PRIMARY KEY,
aliases VARCHAR(10)
);
INSERT INTO notnormal_customers (id,aliases)
VALUES
(1,'|4|58|76'),
(2,''),
(3,''),
(4,'|1|58|76'),
(58,'|1|4|76'),
(76,'|1|4|58');
First, in order to represent the one-to-many relationship for one-customer to many-aliases, I created this table:
CREATE TABLE customer_aliases (
primary_id INT NOT NULL,
alias_id INT NOT NULL,
FOREIGN KEY (primary_id) REFERENCES notnormal_customers(id),
FOREIGN KEY (alias_id) REFERENCES notnormal_customers(id),
/* clustered primary key prevents duplicates */
PRIMARY KEY (primary_id,alias_id)
)
Most importantly, we'll use a custom SPLIT_STR
function:
CREATE FUNCTION SPLIT_STR(
x VARCHAR(255),
delim VARCHAR(12),
pos INT
)
RETURNS VARCHAR(255)
RETURN REPLACE(SUBSTRING(SUBSTRING_INDEX(x, delim, pos),
LENGTH(SUBSTRING_INDEX(x, delim, pos -1)) + 1),
delim, '');
Then we'll create a stored procedure to do all the work. Code is annotated with comments to source references.
DELIMITER $$
CREATE PROCEDURE normalize_customers()
BEGIN
DECLARE cust_id INT DEFAULT 0;
DECLARE al_id INT UNSIGNED DEFAULT 0;
DECLARE alias_str VARCHAR(10) DEFAULT '';
/* set the value of the string delimiter */
DECLARE string_delim CHAR(1) DEFAULT '|';
DECLARE count_aliases INT DEFAULT 0;
DECLARE i INT DEFAULT 1;
/*
use cursor to iterate through all customer records
http://burnignorance.com/mysql-tips/how-to-loop-through-a-result-set-in-mysql-strored-procedure/
*/
DECLARE done INT DEFAULT 0;
DECLARE cur CURSOR FOR
SELECT `id`, `aliases`
FROM `notnormal_customers`;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
OPEN cur;
read_loop: LOOP
/*
Fetch one record from CURSOR and set to customer id and alias string.
If not found then `done` will be set to 1 by continue handler.
*/
FETCH cur INTO cust_id, alias_str;
IF done THEN
/* If done set to 1 then exit the loop, else continue. */
LEAVE read_loop;
END IF;
/* skip to next record if no aliases */
IF alias_str = '' THEN
ITERATE read_loop;
END IF;
/*
get number of aliases
https://pisceansheart.wordpress.com/2008/04/15/count-occurrence-of-character-in-a-string-using-mysql/
*/
SET count_aliases = LENGTH(alias_str) - LENGTH(REPLACE(alias_str, string_delim, ''));
/* strip off the first pipe to make it compatible with our SPLIT_STR function */
SET alias_str = SUBSTR(alias_str, 2);
/*
iterate and get each alias from custom split string function
https://stackoverflow.com/questions/18304857/split-delimited-string-value-into-rows
*/
WHILE i <= count_aliases DO
/* get the next alias id */
SET al_id = CAST(SPLIT_STR(alias_str, string_delim, i) AS UNSIGNED);
/* REPLACE existing values instead of insert to prevent errors on primary key */
REPLACE INTO customer_aliases (primary_id,alias_id) VALUES (cust_id,al_id);
SET i = i+1;
END WHILE;
SET i = 1;
END LOOP;
CLOSE cur;
END$$
DELIMITER ;
Finally you can simply run it by calling:
CALL normalize_customers();
Then you can check the data in console:
mysql> select * from customer_aliases;
+------------+----------+
| primary_id | alias_id |
+------------+----------+
| 4 | 1 |
| 58 | 1 |
| 76 | 1 |
| 1 | 4 |
| 58 | 4 |
| 76 | 4 |
| 1 | 58 |
| 4 | 58 |
| 76 | 58 |
| 1 | 76 |
| 4 | 76 |
| 58 | 76 |
+------------+----------+
12 rows in set (0.00 sec)
Update 2 (One-Query-Solution)
Assuming that the aliases list is always sorted, you can achieve the result with only one query:
CREATE TABLE aliases (
id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
customer_id INT UNSIGNED NOT NULL,
alias_id INT UNSIGNED NOT NULL
) AS
SELECT NULL AS id, c1.id AS customer_id, c2.id AS alias_id
FROM customers c1
JOIN customers c2
ON c2.aliases LIKE CONCAT('|', c1.id , '|%') -- c1.id is the first alias of c2.id
WHERE c1.id < (SUBSTRING(c1.aliases,2)+0) -- c1.id is smaller than the first alias of c2.id
It will also be much faster, if the aliases
column is indexed, so the JOIN will be supported by a range search.
sqlfiddle
Original answer
If you replace the pipes with commas, you can use the FIND_IN_SET function.
I would first create a temporary table (does not need to be technicaly temporary) to store comma separated alias lists:
CREATE TABLE tmp (`id` int, `aliases` varchar(50));
INSERT INTO tmp(`id`, `aliases`)
SELECT id, REPLACE(aliases, '|', ',') AS aliases
FROM customers;
Then populate your normalized table using FIND_IN_SET in the JOINs ON clause:
CREATE TABLE aliases (`id` int, `customer_id` int, `alias_id` int) AS
SELECT t.id as customer_id, c.id AS alias_id
FROM tmp t
JOIN customers c ON find_in_set(c.id, t.aliases);
If needed - delete duplicates with higher customer_id (only keep lowest):
DELETE FROM aliases
WHERE customer_id IN (SELECT * FROM(
SELECT DISTINCT a1.customer_id
FROM aliases a1
JOIN aliases a2
ON a2.customer_id = a1.alias_id
AND a1.customer_id = a2.alias_id
AND a1.customer_id > a1.alias_id
)derived);
If needed - create AUTO_INCREMENT id:
ALTER TABLE aliases ADD column id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
The aliases
table will now look like that:
| id | customer_id | alias_id |
|----|-------------|----------|
| 1 | 1 | 4 |
| 2 | 1 | 58 |
| 3 | 1 | 76 |
sqlfiddle
Don't forget to define proper indexes.
Update 1
You can skip creating a temporary table and populate the aliases
table using LIKE instead of FIND_IN_SET:
CREATE TABLE aliases (`customer_id` int, `alias_id` int) AS
SELECT c2.id as customer_id, c1.id AS alias_id
FROM customers c1
JOIN customers c2
ON CONCAT(c1.aliases, '|') LIKE CONCAT('%|', c2.id , '|%');
sqlfiddle
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With