I am wondering how to convert comma-delimited values into rows in Redshift. I am afraid that my own solution isn't optimal. Please advise. I have table with one of the columns with coma-separated values. For example:
I have:
user_id|user_name|user_action ----------------------------- 1 | Shone | start,stop,cancell...
I would like to see
user_id|user_name|parsed_action ------------------------------- 1 | Shone | start 1 | Shone | stop 1 | Shone | cancell ....
Splits a string on the specified delimiter and returns the part at the specified position.
To extract the beginning segment of a string based on the length in bytes, you can CAST the string as VARCHAR(byte_length) to truncate the string, where byte_length is the required length.
A slight improvement over the existing answer is to use a second "numbers" table that enumerates all of the possible list lengths and then use a cross join
to make the query more compact.
Redshift does not have a straightforward method for creating a numbers table that I am aware of, but we can use a bit of a hack from https://www.periscope.io/blog/generate-series-in-redshift-and-mysql.html to create one using row numbers.
Specifically, if we assume the number of rows in cmd_logs
is larger than the maximum number of commas in the user_action
column, we can create a numbers table by counting rows. To start, let's assume there are at most 99 commas in the user_action
column:
select (row_number() over (order by true))::int as n into numbers from cmd_logs limit 100;
If we want to get fancy, we can compute the number of commas from the cmd_logs
table to create a more precise set of rows in numbers
:
select n::int into numbers from (select row_number() over (order by true) as n from cmd_logs) cross join (select max(regexp_count(user_action, '[,]')) as max_num from cmd_logs) where n <= max_num + 1;
Once there is a numbers
table, we can do:
select user_id, user_name, split_part(user_action,',',n) as parsed_action from cmd_logs cross join numbers where split_part(user_action,',',n) is not null and split_part(user_action,',',n) != '';
Another idea is to transform your CSV string into JSON first, followed by JSON extract, along the following lines:
... '["' || replace( user_action, '.', '", "' ) || '"]' AS replaced
... JSON_EXTRACT_ARRAY_ELEMENT_TEXT(replaced, numbers.i) AS parsed_action
Where "numbers" is the table from the first answer. The advantage of this approach is the ability to use built-in JSON functionality.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With