Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Managing overflows in LISTAGG on Amazon Redshift

Using the example from this post: https://blogs.oracle.com/datawarehousing/entry/managing_overflows_in_listagg

The following statement:

SELECT
deptno,
LISTAGG(ename, ';') WITHIN GROUP (ORDER BY empno) AS namelist 
FROM emp 
GROUP BY deptno;

will generate the following output:

DEPTNO     NAMELIST
---------- ----------------------------------------
10         CLARK;KING;MILLER
20         SMITH;JONES;SCOTT;ADAMS;FORD
30         ALLEN;WARD;MARTIN;BLAKE;TURNER;JAMES

Let’s assume that the above statement does not run and that we have a limit of 15 characters that can be returned by each row in our LISTAGG function. This is in actuality 65535 on Amazon Redshift.

We would want the following to be returned in this case:

DEPTNO     NAMELIST
---------- ----------------------------------------
10         CLARK;KING
10         MILLER
20         SMITH;JONES
20         SCOTT;ADAMS
20         FORD
30         ALLEN;WARD
30         MARTIN;BLAKE
30         TURNER;JAMES 

What would be the best way to recreate this result in Amazon Redshift to avoid any data loss and taking speed into consideration?

like image 277
Pouya Yousefi Avatar asked Mar 11 '16 21:03

Pouya Yousefi


1 Answers

It's possible to achieve this with 2 subquery:

First:

SELECT id, field,
       sum(length(field) + 1) over 
       (partition by id order by RANDOM() rows unbounded preceding) as total_length_now
    from my_schema.my_table)

Initially we want to calculate how many chars we have for each id in our table. We can use a window function to calculate it incrementally for each row. In the 'order by' statement you can use any unique field that you have. If you don't have one, you can simply use random or an hash function, but is mandatory that the field is unique, if not, the function will not work as we want.

The '+1' in the length represent the semicolon that we will use in the listagg function.

Second:

SELECT id, field, total_length_now / 65535 as sub_id
FROM (sub_query_1)

Now we create a sub_id based on the length that we calculated before. If the total_length_now exceed the limit size (in this case 65535) the division's rest will return a new sub_id.

Last Step

SELECT id, sub_id, listagg(field, ';') as namelist
FROM (sub_query_2)
GROUP BY id, sub_id
ORDER BY id, sub_id

Now we can simply call the listagg function grouping by id and sub_id, since each group cannot exceed the size limit.

Complete query

SELECT id, sub_id, listagg(field, ';') as namelist
FROM (
SELECT id, field, total_length_now / 65535 as sub_id
FROM (SELECT id,
       field,
       sum(length(field) + 1) over 
       (partition by id order by field rows unbounded preceding) as total_length_now
from support.test))
GROUP BY id, sub_id
order by id, sub_id

Example with your data (with size limit = 10)

First and second query output:

id, field, total_length_now, sub_id

10,KING,5,0
10,CLARK,11,1
10,MILLER,18,1
20,ADAMS,6,0
20,SMITH,12,1
20,JONES,18,1
20,FORD,23,2
20,SCOTT,29,2
30,JAMES,6,0
30,BLAKE,12,1
30,WARD,17,1
30,MARTIN,24,2
30,TURNER,31,3
30,ALLEN,37,3

Final query output:

id,sub_id,namelist

10,0,KING
10,1,CLARK;MILLER
20,0,ADAMS
20,1,SMITH;JONES
20,2,FORD;SCOTT
30,0,JAMES
30,1,BLAKE;WARD
30,2,MARTIN
30,3,TURNER;ALLEN
like image 151
Hyruma92 Avatar answered Jan 02 '23 10:01

Hyruma92