We just started migrating our queries from Legacy to Standard SQL so we are learning on how to process nested data and arrays now.
Basically what we want to do is to retrieve from ga_sessions
table the following data:
visitor id, session id, array of skus
visitor 1, session 1, [sku_0, sku_1, (...), sku_n]
visitor 1, session 2, [skus]
To do so we ran this simple query:
WITH
customers_data AS(
SELECT
fullvisitorid fv,
visitid v,
ARRAY_AGG((
SELECT
prods.productsku
FROM
UNNEST(hits.product) prods)) sku
FROM
`dataset_id.ga_sessions_*`,
UNNEST(hits) hits
WHERE
1 = 1
AND _table_suffix BETWEEN FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 0 DAY))
--and (select count(productsku) from unnest(hits.product) where productsku is not null) = 1
GROUP BY
fv,
v
LIMIT
100 )
SELECT
*
FROM
customers_data
But we get this error:
Error: Scalar subquery produced more than one element
The data that comes from the hits
field looks something like this:
So when we addded back the where
clause:
and (select count(productsku) from unnest(hits.product) where productsku is not null) = 1
It does not give an error but the results have duplicated skus and we also lost the skus inside the bigger arrays.
Is there some mistake in our query preventing the arrays of being unnested?
A scalar subquery can be used anywhere in an SQL query that a column or expression can be used. FROM emp, (SELECT dept_name FROM dept WHERE dept = 'finance') dept1; Scalar subqueries can also be used for inserting into tables, based on values from other tables.
To convert an ARRAY into a set of rows, also known as "flattening," use the UNNEST operator. UNNEST takes an ARRAY and returns a table with a single row for each element in the ARRAY . Because UNNEST destroys the order of the ARRAY elements, you may wish to restore order to the table.
When the subquery is written with SELECT AS STRUCT , the SELECT list can include multiple columns, and the value returned by the array subquery is an ARRAY of the constructed STRUCTs. Selecting multiple columns without using SELECT AS is an error. ARRAY subqueries can use SELECT AS STRUCT to build arrays of structs.
If I understand correctly, I think you want something like this:
WITH customers_data AS (
SELECT
fullvisitorid fv,
visitid v,
ARRAY_CONCAT_AGG(ARRAY(
SELECT productsku FROM UNNEST(hits.product))) sku
FROM
`dataset_id.ga_sessions_*`,
UNNEST(hits) hits
WHERE
_table_suffix BETWEEN
FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 0 DAY))
GROUP BY
fv,
v
LIMIT
100
)
SELECT
*
FROM
customers_data;
This preserves all of the SKUs through the use of ARRAY_CONCAT_AGG
over an ARRAY
subquery that extracts the SKUs for each row. If you want to deduplicate all of the SKUs across rows, you can replace
SELECT
*
FROM
customers_data;
with:
SELECT *
REPLACE (ARRAY(SELECT DISTINCT s FROM UNNEST(sku) AS s) AS sku)
FROM
customers_data;
Edit: For more reading, take a look at types of expression subqueries in the documentation. In your case, you needed an ARRAY
subquery, since the idea was to take an ARRAY<STRUCT<...>>
in each row and transform it into an ARRAY
of the field type in order to concatenate the arrays across rows.
ARRAY_AGG
creates an array from individual elements, whereas ARRAY_CONCAT_AGG
creates an array from the concatenation of arrays. The difference between them is similar to the difference between the array literal constructor []
and ARRAY_CONCAT
, except that the _AGG
versions are aggregate functions.
As a standalone example, you can try:
WITH T AS (
SELECT ARRAY<STRUCT<x INT64, y INT64>>[(1, 10), (2, 11), (3, 12)] AS arr UNION ALL
SELECT ARRAY<STRUCT<x INT64, y INT64>>[(4, 13)] UNION ALL
SELECT ARRAY<STRUCT<x INT64, y INT64>>[(5, 14), (6, 15)]
)
SELECT ARRAY(SELECT x FROM UNNEST(arr)) AS x_array
FROM T;
This returns a column x_array
where the elements in each array are those of the x
field from each element in arr
. To concatenate all of the arrays so that there is a single row in the result, use ARRAY_CONCAT_AGG
, e.g.:
WITH T AS (
SELECT ARRAY<STRUCT<x INT64, y INT64>>[(1, 10), (2, 11), (3, 12)] AS arr UNION ALL
SELECT ARRAY<STRUCT<x INT64, y INT64>>[(4, 13)] UNION ALL
SELECT ARRAY<STRUCT<x INT64, y INT64>>[(5, 14), (6, 15)]
)
SELECT ARRAY_CONCAT_AGG(ARRAY(SELECT x FROM UNNEST(arr))) AS x_array
FROM T;
For your other question, REPLACE
accepts a list of expressions paired with the columns that they are meant to replace. The expression can be something simple such as a literal, or it can be something more complicated such as an ARRAY
subquery, which is what I used. For example:
WITH T AS (
SELECT 1 AS x, 'foo' AS y, true AS z UNION ALL
SELECT 2, 'bar', false UNION ALL
SELECT 3, 'baz', true
)
SELECT * REPLACE(1 - x AS x, CAST(x AS STRING) AS y)
FROM T;
This replaces the original x
and y
columns that would have been returned from the SELECT *
with the results of 1 - x
and CAST(x AS STRING)
instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With