Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split values over multiple rows in RedShift

The question of how to split a field (e.g. a CSV string) into multiple rows has already been answered: Split values over multiple rows.

However, this question refers to MSSQL, and the answers use various features for which there are no RedShift equivalents.

For the sake of completeness, here's an example of what I'd like to do:

Current data:

| Key | Data     |
+-----+----------+
| 1   | 18,20,22 |
| 2   | 17,19    |

Required data:

| Key | Data     |
+-----+----------+
| 1   | 18       |
| 1   | 20       |
| 1   | 22       |
| 2   | 17       |
| 2   | 19       |

Now, I can suggest a walkaround for the case of small, bounded number of elements in the CSV field: use split_part and union over all possible array locations, like so:

SELECT Key, split_part(Data, ',', 1) 
FROM mytable
WHERE split_part(Data, ',', 1) != ""
    UNION
SELECT Key, split_part(Data, ',', 2) 
FROM mytable
WHERE split_part(Data, ',', 2) != ""
-- etc. etc.

However, this is obviously very inefficient, and would not work for longer lists.

Any better ideas on how to do this?

EDIT:

There's also a somewhat similar question regarding multiplying rows: splitting rows in Redshift. However I don't see how this approach can be applied here.

EDIT 2:

A possible duplicate: Redshift. Convert comma delimited values into rows. But nothing new - the answer by @Masashi Miyazaki is similar to my suggestion above, and suffers from the same issues.

like image 560
etov Avatar asked Apr 28 '15 13:04

etov


People also ask

What is split part redshift?

Splits a string on the specified delimiter and returns the part at the specified position.

Does Unnest work in redshift?

To unnest queries, Amazon Redshift uses the PartiQL syntax to iterate over SUPER arrays. It does this by navigating the array using the FROM clause of a query. Using the previous example, the following example iterates over the attribute values for c_orders .

Can you pivot in redshift?

Amazon Redshift now natively supports PIVOT and UNPIVOT SQL operators with built-in optimizations that you can use for data modeling, data analysis, and data presentation. You can apply PIVOT and UNPIVOT to tables, sub-queries, and common table expressions (CTEs).


1 Answers

Here is the Redshift answer, it will work with up to 10 thousand values per row.

Set up test data

create table test_data (key varchar(50),data varchar(max));
insert into test_data
    values
      (1,'18,20,22'),
      (2,'17,19')
;

code

with ten_numbers as (select 1 as num union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 0)
  , generted_numbers AS
(
    SELECT (1000 * t1.num) + (100 * t2.num) + (10 * t3.num) + t4.num AS gen_num
    FROM ten_numbers AS t1
      JOIN ten_numbers AS t2 ON 1 = 1
      JOIN ten_numbers AS t3 ON 1 = 1
      JOIN ten_numbers AS t4 ON 1 = 1
)
  , splitter AS
(
    SELECT *
    FROM generted_numbers
    WHERE gen_num BETWEEN 1 AND (SELECT max(REGEXP_COUNT(data, '\\,') + 1)
                                 FROM test_data)
)
  , expanded_input AS
(
    SELECT
      key,
      split_part(data, ',', s.gen_num) AS data
    FROM test_data AS td
      JOIN splitter AS s ON 1 = 1
    WHERE split_part(data, ',', s.gen_num) <> ''
)
SELECT * FROM expanded_input
order by key,data;
like image 159
Jon Scott Avatar answered Oct 06 '22 00:10

Jon Scott