I am trying to run a query that lists all the distinct markets a user has submitted to for my dataset.
The values in the Markets column are already in an array format. When I run the query below, I get an array of arrays and some markets may be listed multiple times because the distinct clause is looking at the unique arrays and not the values in the arrays.
For example if I'm trying to group ['New York'] and ['New York' , 'Chicago'], my goal is to get ['New York', 'Chicago'] as my result, but am currently getting [['New York'],['New York', 'Chicago']]. Appreciate any assistance.
SELECT 
  s.submitter_id,
  ARRAY_AGG(DISTINCT s.markets)
FROM 
  analytics.submissions AS s
GROUP BY 1
A simple way is to first flatten the array
WITH data AS (
 SELECT submitter_id, split(markets,';') AS markets 
 FROM VALUES (1,'new york'), (1,'new york;chicargo') s(submitter_id, markets)
)
SELECT 
  a.submitter_id,
  ARRAY_AGG(DISTINCT a.market)
FROM (
    SELECT s.submitter_id
        ,f.value AS market
    FROM data AS s,
    LATERAL FLATTEN(input => s.markets) f
) AS a
GROUP BY 1;
A variation using a javascript UDF:
WITH data AS (
 SELECT submitter_id, split(markets,';') AS markets 
 FROM VALUES (1,'new york'), (1,'new york;chicargo') s(submitter_id, markets)
)
SELECT 
  submitter_id,
  array_flat_distinct(ARRAY_AGG(distinct markets))
from data
group by 1;
Where the UDF is defined as:
create or replace function array_flat_distinct("a" array)
returns array
language javascript
as
$$
    return [...new Set(a.reduce((b,c)=>[...b,...c]))]
$$
;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With