I have a table in BigQuery with an ID field and a repeated record field along with some other fields like a data collection time.
There are multiple rows in this table for each ID and I want to somehow select/merge a single row for each ID. Almost every solution like selecting last, selecting first and aggregating rows with duplicate ID in one row are acceptable in my use case but I wasn't able to get any of them to work so far.
To be more precise my table has an ID field which in BigQuery terms is:
{name: ID, type: STRING, mode: NULLABLE}
and repeated field: {name: data, type: RECORD, mode: REPEATED}
along with some other (plain) fields.
In my table there are multiple rows for each ID that each one has a repeated field data
for itself. In my query result, I want to have a table with exact same schema in which each ID appears only once and its corresponding data
field is data
field from one of occurrences of ID in the original table. (or ideally union from all its occurances)
Here is a list of solutions that don't work here:
First: Using
row_number() OVER (PARTITION BY ID ORDER BY collection_time) as rn ... where rn=1
Cause: BigQuery flattens results when using partition by
even if it Unflatten Results option is used.
Second: Selecting row with max/min collection time value:
Because: the value of the column is not unique for each id in my table due to some duplication in other parts of system.
Third: Using group by
ID with nest
/first
on other fields.
Cause: using nest
on the repeated record destroys the relation in the record field. For example SELECT ID, nest(data.a), nest(data.b)
from:
ID data.a data.b
--------------------------
1 1a1 null
1a2 1b2
--------------------------
1 2a1 2b1
null 2b2
results in
ID data.a data.b
----------------------------
1 1a1 1b2
1a2 2b1
2a1 2b2
You'll have an easier time solving this using standard SQL (uncheck "Use Legacy SQL" under "Show Options"). You would use GROUP BY
with ARRAY_CONCAT_AGG
, e.g.:
SELECT id, ARRAY_CONCAT_AGG(data) AS data
FROM MyTable
GROUP BY id;
Try below in Standard SQL mode
SELECT id, ARRAY_AGG(STRUCT(a, b)) AS data
FROM (
SELECT id, a, ROW_NUMBER() OVER() AS num
FROM YourTable, UNNEST(data) WHERE NOT a IS NULL
) FULL OUTER JOIN (
SELECT id, b, ROW_NUMBER() OVER() AS num
FROM YourTable, UNNEST(data) WHERE NOT b IS NULL
)
USING(id, num)
GROUP BY id
it gives you exactly
result you expect in your question (with NULLs being eliminated):
ID data.a data.b
----------------------------
1 1a1 1b2
1a2 2b1
2a1 2b2
If (on the other hand) you would wanted to preserve original a/b pairs - you should use below (still in Standard SQL mode)
SELECT id, ARRAY_CONCAT_AGG(data) AS data
FROM YourTable
GROUP BY id
This gives you below result
ID data.a data.b
----------------------------
1 1a1 null
1a2 1b2
2a1 2b1
null 2b2
You can test both query either by running them against your actual table (change YourTable
to your actual table -> `project.dataset.table`) or by prepending respective query with below code and running as is
WITH YourTable AS (
SELECT 1 AS id, ARRAY<STRUCT<a STRING, b STRING>>[('1a1', NULL),('1a2','1b2')] AS data UNION ALL
SELECT 1 AS id, ARRAY<STRUCT<a STRING, b STRING>>[('2a1', '2b1'),(NULL,'2b2')] AS data
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With