I am having trouble joining on a repeated nested field while still preserving the original row structure in BigQuery.
For my example I'll call the two tables being joined A
and B
.
Records in table A look something like:
{
"url":"some url",
"repeated_nested": [
{"key":"some key","property":"some property"}
]
}
and records in table B look something like:
{
"key":"some key",
"property2": "another property"
}
I am hoping to find a way to join this data together to generate a row that looks like:
{
"url":"some url",
"repeated_nested": [
{
"key":"some key",
"property":"some property",
"property2":"another property"
}
]
}
The very first query I tried was:
SELECT
url, repeated_nested.key, repeated_nested.property, repeated_nested.property2
FROM A
AS lefttable
LEFT OUTER JOIN B
AS righttable
ON lefttable.key=righttable.key
This doesn't work because BQ can't join on repeated nested fields. There is not a unique identifier for each row. If I were to do a FLATTEN
on repeated_nested
then I'm not sure how to get the original row put back together correctly.
The data is such that a url
will always have the same repeated_nested
field with it. Because of that, I was able to make a workaround using a UDF to sort of roll up this repeated nested object into a JSON string and then unroll it again:
SELECT url, repeated_nested.key, repeated_nested.property, repeated_nested.property2
FROM
JS(
(
SELECT basetable.url as url, repeated_nested
FROM A as basetable
LEFT JOIN (
SELECT url, CONCAT("[", GROUP_CONCAT_UNQUOTED(repeated_nested_json, ","), "]") as repeated_nested
FROM
(
SELECT
url,
CONCAT(
'{"key": "', repeated_nested.key, '",',
' "property": "', repeated_nested.property, '",',
' "property2": "', mapping_table.property2, '"',
'}'
)
) as repeated_nested_json
FROM (
SELECT
url, repeated_nested.key, repeated_nested.property
FROM A
GROUP BY url, repeated_nested.key, repeated_nested.property
) as urltable
LEFT OUTER JOIN [SDF.alchemy_to_ric]
AS mapping_table
ON urltable.repeated_nested.key=mapping_table.key
)
GROUP BY url
) as companytable
ON basetable.url = urltable.url
),
// input columns:
url, repeated_nested_json,
// output schema:
"[{'name': 'url', 'type': 'string'},
{'name': 'repeated_nested_json', 'type': 'RECORD', 'mode':'REPEATED', 'fields':
[ { 'name': 'key', 'type':'string' },
{ 'name': 'property', 'type':'string' },
{ 'name': 'property2', 'type':'string' }]
}]",
// UDF:
"function(row, emit) {
parsed_repeated_nested = [];
try {
if ( row.repeated_nested_json != null ) {
parsed_repeated_nested = JSON.parse(row.repeated_nested_json);
}
} catch (ex) { }
emit({
url: row.url,
repeated_nested: parsed_repeated_nested
});
}"
)
This solution works fine for small tables. But the real life tables I'm working with have many more columns than in my example above. When there are other fields in addition to url
and repeated_nested_json
they all have to be passed through the UDF. When I work with tables that are around the 50 gb range everything is fine. But when I apply the UDF and query to tables that are 500-1000 gb, I get an Internal Server Error from BQ.
In the end I just need all of the data in new line delimited JSON format in GCS. As a last ditch effort I tried concatenating all of the fields into a JSON string (so that I only had 1 column) in the hopes that I could export it as CSV and have what I need. However, the export process escaped the double quotes and adds double quotes around the JSON string. According to the BQ docs on jobs (https://cloud.google.com/bigquery/docs/reference/v2/jobs) there is a property configuration.query.tableDefinitions.(key).csvOptions.quote
that could help me. But I can't figure out how to make it work.
Does anybody have advice on how they have dealt with this sort of situation?
Use INT64 data types in joins to reduce cost and improve comparison performance. Best practice: If your use case supports it, use INT64 data types in joins instead of STRING data types. BigQuery does not index primary keys like traditional databases, so the wider the join column is, the longer the comparison takes.
How to Query BigQuery Repeated Fields. To extract information from a repeated field in BigQuery, you must use a more exotic pattern. This is normally done using the UNNEST function, which converts an array of values in a table into rows. These can then be joined to the original table to be queried.
A repeated field can be accessed as an ARRAY type in Google Standard SQL. A RECORD column can have REPEATED mode, which is represented as an array of STRUCT types. Also, a field within a record can be repeated, which is represented as a STRUCT that contains an ARRAY . An array cannot contain another array directly.
BigQuery automatically flattens nested fields when querying. To query a column with nested data, each field must be identified in the context of the column that contains it. For example: customer.id refers to the id field in the customer column.
I have never had to do this, but you should be able to use flatten
, then join, then use nest
to get repeated fields again.
The docs state that BigQuery always flattens query results, but that appears to be false: you can choose to not have results flattened if you set a destination table. You should then be able to export that table as JSON to Storage.
See also this answer for how to get nest
to work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With