Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Condensing arrays in Presto

Tags:

sql

presto

I have a query that produces strings of arrays using they array_agg() function

SELECT 
array_agg(message) as sequence
from mytable
group by id

which produces a table that looks like this:

                 sequence
1 foo foo bar baz bar baz
2     foo bar bar bar baz
3 foo foo foo bar bar baz

but I aim to condense the array of strings so that none can repeat more than once in a row, for example, the desired output would look like this:

    sequence
1 foo bar baz bar baz
2 foo bar baz
3 foo bar baz

How would one go about doing this with Presto SQL ?

like image 273
the_darkside Avatar asked May 28 '19 20:05

the_darkside


People also ask

What is Unnest in Presto?

unnest is normally used with a join and will expand the array into relation (i.e. for every element of array an row will be introduced).

How do you concatenate in Presto?

The || operator performs concatenation.

What is Array_join in SQL?

This is the function to use if you want to concatenate all the values in an array field into one string value. You can specify an optional argument as a separator, and it can be any string. If you do not specify a separator, there will be nothing aded between the values.

What is Array_agg in SQL?

The ARRAY_AGG aggregator creates a new SQL. ARRAY value per group that will contain the values of group as its items. ARRAY_AGG is not preserving order of values inside a group. If an array needs to be ordered, a LINQ OrderBy can be used. ARRAY_AGG and EXPLODE are conceptually inverse operations.


1 Answers

You can do this in one of two ways:

  1. Remove duplicates from the resulting arrays using the array_distinct function:
WITH mytable(id, message) AS (VALUES
  (1, 'foo'), (1, 'foo'), (1, 'bar'), (1, 'bar'), (1, 'baz'), (1, 'baz'),
  (2, 'foo'), (2, 'bar'), (2, 'bar'), (2, 'bar'), (2, 'baz'),
  (3, 'foo'), (3, 'foo'), (3, 'foo'), (3, 'bar'), (3, 'bar'), (3, 'baz')
)
SELECT array_distinct(array_agg(message)) AS sequence
FROM mytable
GROUP BY id
  1. Use the DISTINCT qualifier in the aggregation to remove the duplicate values before they are passed into array_agg.
WITH mytable(id, message) AS (VALUES
  (1, 'foo'), (1, 'foo'), (1, 'bar'), (1, 'bar'), (1, 'baz'), (1, 'baz'),
  (2, 'foo'), (2, 'bar'), (2, 'bar'), (2, 'bar'), (2, 'baz'), (3, 'foo'),
  (3, 'foo'), (3, 'foo'), (3, 'bar'), (3, 'bar'), (3, 'baz')
)
SELECT array_agg(DISTINCT message) AS sequence
FROM mytable
GROUP BY id

Both alternatives produce the same result:

    sequence
-----------------
 [foo, bar, baz]
 [foo, bar, baz]
 [foo, bar, baz]
(3 rows)

UPDATE: You can remove repeated sequences of elements with the recently introduced MATCH_RECOGNIZE feature:

WITH mytable(id, message) AS (VALUES
  (1, 'foo'), (1, 'foo'), (1, 'bar'), (1, 'baz'), (1, 'bar'), (1, 'baz'),
  (2, 'foo'), (2, 'bar'), (2, 'bar'), (2, 'bar'), (2, 'baz'),
  (3, 'foo'), (3, 'foo'), (3, 'foo'), (3, 'bar'), (3, 'bar'), (3, 'baz')
)
SELECT array_agg(value) AS sequence
FROM mytable
 MATCH_RECOGNIZE(
    PARTITION BY id
    MEASURES A.message AS value
    PATTERN (A B*)
    DEFINE B AS message = PREV(message)
)
GROUP BY id
like image 76
Martin Traverso Avatar answered Sep 27 '22 22:09

Martin Traverso