bigquery joins on nested repeated

Tags:

google-bigquery

I am having trouble joining on a repeated nested field while still preserving the original row structure in BigQuery.

For my example I'll call the two tables being joined A and B.

Records in table A look something like:

{
  "url":"some url",
  "repeated_nested": [
    {"key":"some key","property":"some property"}
  ]
}

and records in table B look something like:

{
  "key":"some key",
  "property2": "another property"
}

I am hoping to find a way to join this data together to generate a row that looks like:

{
  "url":"some url",
  "repeated_nested": [
    {
      "key":"some key",
      "property":"some property",
      "property2":"another property"
    }
  ]
}

The very first query I tried was:

SELECT 
  url, repeated_nested.key, repeated_nested.property, repeated_nested.property2
FROM A
  AS lefttable
LEFT OUTER JOIN B
  AS righttable
  ON lefttable.key=righttable.key

This doesn't work because BQ can't join on repeated nested fields. There is not a unique identifier for each row. If I were to do a FLATTEN on repeated_nested then I'm not sure how to get the original row put back together correctly.

The data is such that a url will always have the same repeated_nested field with it. Because of that, I was able to make a workaround using a UDF to sort of roll up this repeated nested object into a JSON string and then unroll it again:

SELECT url, repeated_nested.key, repeated_nested.property, repeated_nested.property2

FROM
JS(
   (
    SELECT basetable.url as url, repeated_nested
    FROM A as basetable

    LEFT JOIN (
      SELECT url, CONCAT("[", GROUP_CONCAT_UNQUOTED(repeated_nested_json, ","), "]") as repeated_nested
      FROM
      (
        SELECT
          url,
            CONCAT(
              '{"key": "', repeated_nested.key, '",',
              ' "property": "', repeated_nested.property, '",',
              ' "property2": "', mapping_table.property2, '"',
              '}'
            )
          ) as repeated_nested_json
        FROM (
          SELECT
            url, repeated_nested.key, repeated_nested.property
          FROM A
          GROUP BY url, repeated_nested.key, repeated_nested.property
        ) as urltable

         LEFT OUTER JOIN [SDF.alchemy_to_ric]
            AS mapping_table
            ON urltable.repeated_nested.key=mapping_table.key
      )
      GROUP BY url
    ) as companytable
    ON basetable.url = urltable.url
  ),

  // input columns:
  url, repeated_nested_json,

  // output schema:
  "[{'name': 'url', 'type': 'string'},
   {'name': 'repeated_nested_json', 'type': 'RECORD', 'mode':'REPEATED', 'fields':
    [ { 'name': 'key', 'type':'string' },
      { 'name': 'property', 'type':'string' },
      { 'name': 'property2', 'type':'string' }]
   }]",

  // UDF:
  "function(row, emit) {
    parsed_repeated_nested = [];
    try {
      if ( row.repeated_nested_json != null ) {
          parsed_repeated_nested = JSON.parse(row.repeated_nested_json);
      }
    } catch (ex) { }

    emit({
      url: row.url,
      repeated_nested: parsed_repeated_nested
    });
  }"
)

This solution works fine for small tables. But the real life tables I'm working with have many more columns than in my example above. When there are other fields in addition to url and repeated_nested_json they all have to be passed through the UDF. When I work with tables that are around the 50 gb range everything is fine. But when I apply the UDF and query to tables that are 500-1000 gb, I get an Internal Server Error from BQ.

In the end I just need all of the data in new line delimited JSON format in GCS. As a last ditch effort I tried concatenating all of the fields into a JSON string (so that I only had 1 column) in the hopes that I could export it as CSV and have what I need. However, the export process escaped the double quotes and adds double quotes around the JSON string. According to the BQ docs on jobs (https://cloud.google.com/bigquery/docs/reference/v2/jobs) there is a property configuration.query.tableDefinitions.(key).csvOptions.quote that could help me. But I can't figure out how to make it work.

Does anybody have advice on how they have dealt with this sort of situation?

690

asked Mar 11 '16 00:03

Andrew Backes

1 Answers

I have never had to do this, but you should be able to use flatten, then join, then use nest to get repeated fields again.

The docs state that BigQuery always flattens query results, but that appears to be false: you can choose to not have results flattened if you set a destination table. You should then be able to export that table as JSON to Storage.

See also this answer for how to get nest to work.

172

answered Sep 21 '22 15:09

oulenz

Related questions
                            
                                Convert UNIX time (INT) to timestamp in BigQuery
                            
                                How to convert an Epoch timestamp to a Date in Standard SQL
                            
                                Why BigQuery doesn't have an option to remove column?
                            
                                Unnesting Multiple Nested Fields Deep in BigQuery
                            
                                Cannot insert new value to BigQuery table after updating with new column using streaming API
                            
                                Google BigQuery pricing
                            
                                How do I create a BigQuery View
                            
                                How can I generate unique key values for records in a BigQuery table?
                            
                                BigQuery: Create column of JSON datatype
                            
                                How do I list tables in Google BigQuery that match a certain name?
                            
                                Big query - Concatenate strings horizontally
                            
                                Google BigQuery possible to do Case-Insensitive REGEXP_Match?
                            
                                Storing JSON in BigQuery
                            
                                BigQuery datatypes
                            
                                Joining multiple tables in bigquery
                            
                                BigQuery: How to overwrite a table with bigquery.Client().copy_table method
                            
                                Schedule query in BigQuery
                            
                                importing data from firestore into bigQuery
                            
                                How do I cast dd/mm/yyyy string into date in BigQuery?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With