Select only one row for each "ID" in a nested repeated schema in BigQuery

Question

I have a table in BigQuery with an ID field and a repeated record field along with some other fields like a data collection time.

There are multiple rows in this table for each ID and I want to somehow select/merge a single row for each ID. Almost every solution like selecting last, selecting first and aggregating rows with duplicate ID in one row are acceptable in my use case but I wasn't able to get any of them to work so far.

To be more precise my table has an ID field which in BigQuery terms is: {name: ID, type: STRING, mode: NULLABLE} and repeated field: {name: data, type: RECORD, mode: REPEATED} along with some other (plain) fields. In my table there are multiple rows for each ID that each one has a repeated field data for itself. In my query result, I want to have a table with exact same schema in which each ID appears only once and its corresponding data field is data field from one of occurrences of ID in the original table. (or ideally union from all its occurances)

Here is a list of solutions that don't work here:

First: Using

row_number() OVER (PARTITION BY ID ORDER BY collection_time) as rn ... where  rn=1

Cause: BigQuery flattens results when using partition by even if it Unflatten Results option is used.

Second: Selecting row with max/min collection time value:

Because: the value of the column is not unique for each id in my table due to some duplication in other parts of system.

Third: Using group by ID with nest/first on other fields.

Cause: using nest on the repeated record destroys the relation in the record field. For example SELECT ID, nest(data.a), nest(data.b) from:

ID     data.a      data.b
--------------------------
1      1a1          null
       1a2          1b2
--------------------------
1      2a1          2b1
       null         2b2

results in

ID      data.a       data.b
----------------------------
1        1a1         1b2
         1a2         2b1
         2a1         2b2

Elliott Brossard · Accepted Answer

You'll have an easier time solving this using standard SQL (uncheck "Use Legacy SQL" under "Show Options"). You would use GROUP BY with ARRAY_CONCAT_AGG, e.g.:

SELECT id, ARRAY_CONCAT_AGG(data) AS data
FROM MyTable
GROUP BY id;

Mikhail Berlyant · Answer

Try below in Standard SQL mode

SELECT id, ARRAY_AGG(STRUCT(a, b)) AS data
FROM (
  SELECT id, a, ROW_NUMBER() OVER() AS num 
  FROM YourTable, UNNEST(data) WHERE NOT a IS NULL 
) FULL OUTER JOIN (
  SELECT id, b, ROW_NUMBER() OVER() AS num 
  FROM YourTable, UNNEST(data) WHERE NOT b IS NULL 
)  
USING(id, num) 
GROUP BY id

it gives you exactly result you expect in your question (with NULLs being eliminated):

ID      data.a       data.b
----------------------------
1        1a1         1b2
         1a2         2b1
         2a1         2b2

If (on the other hand) you would wanted to preserve original a/b pairs - you should use below (still in Standard SQL mode)

SELECT id, ARRAY_CONCAT_AGG(data) AS data
FROM YourTable
GROUP BY id

This gives you below result

ID      data.a       data.b
----------------------------
1        1a1         null
         1a2         1b2
         2a1         2b1
         null        2b2

You can test both query either by running them against your actual table (change YourTable to your actual table -> `project.dataset.table`) or by prepending respective query with below code and running as is

WITH YourTable AS (
  SELECT 1 AS id, ARRAY<STRUCT<a STRING, b STRING>>[('1a1', NULL),('1a2','1b2')] AS data UNION ALL
  SELECT 1 AS id, ARRAY<STRUCT<a STRING, b STRING>>[('2a1', '2b1'),(NULL,'2b2')] AS data 
)

Select only one row for each "ID" in a nested repeated schema in BigQuery

Tags:

sql

google-bigquery

S.Mohsen sh

2 Answers

Elliott Brossard

Mikhail Berlyant

Recent Activity

Donate For Us

Select only one row for each "ID" in a nested repeated schema in BigQuery

Tags:

sql

google-bigquery

S.Mohsen sh

2 Answers

Elliott Brossard

Mikhail Berlyant

Related questions

Recent Activity

Donate For Us