Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery standard SQL: how to group by an ARRAY field

My table has two columns, id and a. Column id contains a number, column a contains an array of strings. I want to count the number of unique id for a given array, equality between arrays being defined as "same size, same string for each index".

When using GROUP BY a, I get Grouping by expressions of type ARRAY is not allowed. I can use something like GROUP BY ARRAY_TO_STRING(a, ","), but then the two arrays ["a,b"] and ["a","b"] are grouped together, and I lose the "real" value of my array (so if I want to use it later in another query, I have to split the string).

The values in this field array come from the user, so I can't assume that some character is simply never going to be there (and use it as a separator).

like image 663
Ted Avatar asked Feb 23 '18 00:02

Ted


People also ask

How to use arrays in BigQuery?

With BigQuery, you can construct array literals, build arrays from subqueries using the ARRAY function, and aggregate values into an array using the ARRAY_AGG function. You can combine arrays using functions like ARRAY_CONCAT() , and convert arrays to strings using ARRAY_TO_STRING() .

What is array AGG in BigQuery?

ARRAY_AGG. Returns an ARRAY of expression values. To learn more about the optional arguments in this function and how to use them, see Aggregate function calls. To learn more about the OVER clause and how to use it, see Window function calls.

How do I cast an array to a string in SQL?

In order to convert array to a string, Spark SQL provides a built-in function concat_ws() which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. In order to use concat_ws() function, you need to import it using org. apache. spark.

Can we use array in SQL query?

The ARRAY function returns an ARRAY with one element for each row in a subquery. If subquery produces a SQL table, the table must have exactly one column. Each element in the output ARRAY is the value of the single column of a row in the table.


1 Answers

Instead of GROUP BY ARRAY_TO_STRING(a, ",") use GROUP BY TO_JSON_STRING(a)

so your query will look like below

#standardsql
SELECT 
  TO_JSON_STRING(a) arr,
  COUNT(DISTINCT id) cnt
FROM `project.dataset.table`
GROUP BY arr

You can test it with dummy data like below

#standardsql
WITH `project.dataset.table` AS (
  SELECT 1 id, ["a,b", "c"] a UNION ALL
  SELECT 1, ["a","b,c"]
)
SELECT 
  TO_JSON_STRING(a) arr,
  COUNT(DISTINCT id) cnt
FROM `project.dataset.table`
GROUP BY arr  

with result as

Row     arr             cnt  
1       ["a,b","c"]     1    
2       ["a","b,c"]     1    

Update based on @Ted's comment

#standardsql
SELECT 
  ANY_VALUE(a) a,
  COUNT(DISTINCT id) cnt
FROM `project.dataset.table`
GROUP BY TO_JSON_STRING(a)
like image 156
Mikhail Berlyant Avatar answered Oct 01 '22 02:10

Mikhail Berlyant