Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL - How to normalize column containing delimiter-separated IDs

Tags:

sql

php

mysql

I'm trying to normalize a table which a previous developer designed to have a column containing pipe-separated IDs which link to other rows in the same table.

Customers Table

id    |    aliases (VARCHAR)
----------------------------
1     |    |4|58|76
2     |    
3     |
4     |    |1|58|76
...   |    
58    |    |1|4|76
...   |
76    |    |1|4|58

So customer 1, 4, 58 and 76 are all "aliases" of each other. Customer 2 and 3 have no aliases, so the field contains an empty string.

I want to do away with the entire "alias" system, and normalise the data so I can map those other customers all to the one record. So I want related table data for customer 1, 4, 58, and 76 all to be mapped just to customer 1.

I figured I would populate a new table which later I can then join and perform updates on other tables.

Join Table

id  |  customer_id  |  alias_id
-------------------------------
1   |  1            |  4
2   |  1            |  58
3   |  1            |  76

How can I get the data from that first table, into the above format? If this is going to be an absolute nightmare in pure SQL, I will just write a PHP script which attempts to do this work and insert the data.

like image 446
BadHorsie Avatar asked Feb 07 '23 23:02

BadHorsie


2 Answers

When I started to answer this question, I thought it would be quick and easy because I'd done something very similar once in SQL Server, but proving out the concept in translation burgeoned into this full solution.

One caveat that wasn't clear from your question is whether you have a condition for declaring the primary id vs the alias id. For instance, this solution will allow 1 to have an alias of 4 as well as 4 to have an alias of 1, which is consistent with the provided data in your simplified example question.

To setup the data for this example, I used this structure:

CREATE TABLE notnormal_customers (
  id INT NOT NULL PRIMARY KEY,
  aliases VARCHAR(10)
);

INSERT INTO notnormal_customers (id,aliases)
VALUES
(1,'|4|58|76'),
(2,''),
(3,''),
(4,'|1|58|76'),
(58,'|1|4|76'),
(76,'|1|4|58');

First, in order to represent the one-to-many relationship for one-customer to many-aliases, I created this table:

CREATE TABLE customer_aliases (
    primary_id INT NOT NULL,
    alias_id INT NOT NULL,
    FOREIGN KEY (primary_id) REFERENCES notnormal_customers(id),
    FOREIGN KEY (alias_id)   REFERENCES notnormal_customers(id),
    /* clustered primary key prevents duplicates */
    PRIMARY KEY (primary_id,alias_id)
)

Most importantly, we'll use a custom SPLIT_STR function:

CREATE FUNCTION SPLIT_STR(
  x VARCHAR(255),
  delim VARCHAR(12),
  pos INT
)
RETURNS VARCHAR(255)
RETURN REPLACE(SUBSTRING(SUBSTRING_INDEX(x, delim, pos),
       LENGTH(SUBSTRING_INDEX(x, delim, pos -1)) + 1),
       delim, '');

Then we'll create a stored procedure to do all the work. Code is annotated with comments to source references.

DELIMITER $$
CREATE PROCEDURE normalize_customers()
BEGIN

  DECLARE cust_id INT DEFAULT 0;
  DECLARE al_id INT UNSIGNED DEFAULT 0;
  DECLARE alias_str VARCHAR(10) DEFAULT '';
  /* set the value of the string delimiter */
  DECLARE string_delim CHAR(1) DEFAULT '|';
  DECLARE count_aliases INT DEFAULT 0;
  DECLARE i INT DEFAULT 1;

  /*
    use cursor to iterate through all customer records
    http://burnignorance.com/mysql-tips/how-to-loop-through-a-result-set-in-mysql-strored-procedure/
  */
  DECLARE done INT DEFAULT 0;
  DECLARE cur CURSOR FOR
      SELECT `id`, `aliases`
      FROM `notnormal_customers`;
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;

  OPEN cur;
  read_loop: LOOP

    /*
      Fetch one record from CURSOR and set to customer id and alias string.
      If not found then `done` will be set to 1 by continue handler.
    */
    FETCH cur INTO cust_id, alias_str;
    IF done THEN
        /* If done set to 1 then exit the loop, else continue. */
        LEAVE read_loop;
    END IF;

    /* skip to next record if no aliases */
    IF alias_str = '' THEN
      ITERATE read_loop;
    END IF;

    /*
      get number of aliases
      https://pisceansheart.wordpress.com/2008/04/15/count-occurrence-of-character-in-a-string-using-mysql/
    */
    SET count_aliases = LENGTH(alias_str) - LENGTH(REPLACE(alias_str, string_delim, ''));

    /* strip off the first pipe to make it compatible with our SPLIT_STR function */
    SET alias_str = SUBSTR(alias_str, 2);

    /*
      iterate and get each alias from custom split string function
      https://stackoverflow.com/questions/18304857/split-delimited-string-value-into-rows
    */
    WHILE i <= count_aliases DO

      /* get the next alias id */
      SET al_id = CAST(SPLIT_STR(alias_str, string_delim, i) AS UNSIGNED);
      /* REPLACE existing values instead of insert to prevent errors on primary key */
      REPLACE INTO customer_aliases (primary_id,alias_id) VALUES (cust_id,al_id);
      SET i = i+1;

    END WHILE;
    SET i = 1;

  END LOOP;
  CLOSE cur;

END$$
DELIMITER ;

Finally you can simply run it by calling:

CALL normalize_customers();

Then you can check the data in console:

mysql> select * from customer_aliases;
+------------+----------+
| primary_id | alias_id |
+------------+----------+
|          4 |        1 |
|         58 |        1 |
|         76 |        1 |
|          1 |        4 |
|         58 |        4 |
|         76 |        4 |
|          1 |       58 |
|          4 |       58 |
|         76 |       58 |
|          1 |       76 |
|          4 |       76 |
|         58 |       76 |
+------------+----------+
12 rows in set (0.00 sec)
like image 177
Jeff Puckett Avatar answered Feb 09 '23 12:02

Jeff Puckett


Update 2 (One-Query-Solution)

Assuming that the aliases list is always sorted, you can achieve the result with only one query:

CREATE TABLE aliases (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
  customer_id INT UNSIGNED NOT NULL,
  alias_id INT UNSIGNED NOT NULL
) AS
  SELECT NULL AS id, c1.id AS customer_id, c2.id AS alias_id
  FROM customers c1
  JOIN customers c2 
    ON c2.aliases LIKE CONCAT('|', c1.id , '|%') -- c1.id is the first alias of c2.id
  WHERE c1.id < (SUBSTRING(c1.aliases,2)+0) -- c1.id is smaller than the first alias of c2.id

It will also be much faster, if the aliases column is indexed, so the JOIN will be supported by a range search.

sqlfiddle

Original answer

If you replace the pipes with commas, you can use the FIND_IN_SET function.

I would first create a temporary table (does not need to be technicaly temporary) to store comma separated alias lists:

CREATE TABLE tmp (`id` int, `aliases` varchar(50));
INSERT INTO tmp(`id`, `aliases`)
  SELECT id, REPLACE(aliases, '|', ',')  AS aliases
  FROM customers;

Then populate your normalized table using FIND_IN_SET in the JOINs ON clause:

CREATE TABLE aliases (`id` int, `customer_id` int, `alias_id` int) AS
  SELECT t.id as customer_id, c.id AS alias_id
  FROM tmp t
  JOIN customers c ON find_in_set(c.id, t.aliases);

If needed - delete duplicates with higher customer_id (only keep lowest):

DELETE FROM aliases 
WHERE customer_id IN (SELECT * FROM(
  SELECT DISTINCT a1.customer_id
  FROM aliases a1
  JOIN aliases a2
    ON  a2.customer_id = a1.alias_id
    AND a1.customer_id = a2.alias_id
    AND a1.customer_id > a1.alias_id
)derived);

If needed - create AUTO_INCREMENT id:

ALTER TABLE aliases ADD column id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;

The aliases table will now look like that:

| id | customer_id | alias_id |
|----|-------------|----------|
|  1 |           1 |        4 |
|  2 |           1 |       58 |
|  3 |           1 |       76 |

sqlfiddle

Don't forget to define proper indexes.

Update 1

You can skip creating a temporary table and populate the aliases table using LIKE instead of FIND_IN_SET:

CREATE TABLE aliases (`customer_id` int, `alias_id` int) AS
  SELECT c2.id as customer_id, c1.id AS alias_id
  FROM customers c1
  JOIN customers c2 
    ON CONCAT(c1.aliases, '|') LIKE CONCAT('%|', c2.id , '|%');

sqlfiddle

like image 34
Paul Spiegel Avatar answered Feb 09 '23 14:02

Paul Spiegel