Logo Questions Linux Laravel Mysql Ubuntu Git Menu

bigquery joins on nested repeated

I am having trouble joining on a repeated nested field while still preserving the original row structure in BigQuery.

For my example I'll call the two tables being joined A and B.

Records in table A look something like:

  "url":"some url",
  "repeated_nested": [
    {"key":"some key","property":"some property"}

and records in table B look something like:

  "key":"some key",
  "property2": "another property"

I am hoping to find a way to join this data together to generate a row that looks like:

  "url":"some url",
  "repeated_nested": [
      "key":"some key",
      "property":"some property",
      "property2":"another property"

The very first query I tried was:

  url, repeated_nested.key, repeated_nested.property, repeated_nested.property2
  AS lefttable
  AS righttable
  ON lefttable.key=righttable.key

This doesn't work because BQ can't join on repeated nested fields. There is not a unique identifier for each row. If I were to do a FLATTEN on repeated_nested then I'm not sure how to get the original row put back together correctly.

The data is such that a url will always have the same repeated_nested field with it. Because of that, I was able to make a workaround using a UDF to sort of roll up this repeated nested object into a JSON string and then unroll it again:

SELECT url, repeated_nested.key, repeated_nested.property, repeated_nested.property2

    SELECT basetable.url as url, repeated_nested
    FROM A as basetable

      SELECT url, CONCAT("[", GROUP_CONCAT_UNQUOTED(repeated_nested_json, ","), "]") as repeated_nested
              '{"key": "', repeated_nested.key, '",',
              ' "property": "', repeated_nested.property, '",',
              ' "property2": "', mapping_table.property2, '"',
          ) as repeated_nested_json
        FROM (
            url, repeated_nested.key, repeated_nested.property
          FROM A
          GROUP BY url, repeated_nested.key, repeated_nested.property
        ) as urltable

         LEFT OUTER JOIN [SDF.alchemy_to_ric]
            AS mapping_table
            ON urltable.repeated_nested.key=mapping_table.key
      GROUP BY url
    ) as companytable
    ON basetable.url = urltable.url

  // input columns:
  url, repeated_nested_json,

  // output schema:
  "[{'name': 'url', 'type': 'string'},
   {'name': 'repeated_nested_json', 'type': 'RECORD', 'mode':'REPEATED', 'fields':
    [ { 'name': 'key', 'type':'string' },
      { 'name': 'property', 'type':'string' },
      { 'name': 'property2', 'type':'string' }]

  // UDF:
  "function(row, emit) {
    parsed_repeated_nested = [];
    try {
      if ( row.repeated_nested_json != null ) {
          parsed_repeated_nested = JSON.parse(row.repeated_nested_json);
    } catch (ex) { }

      url: row.url,
      repeated_nested: parsed_repeated_nested

This solution works fine for small tables. But the real life tables I'm working with have many more columns than in my example above. When there are other fields in addition to url and repeated_nested_json they all have to be passed through the UDF. When I work with tables that are around the 50 gb range everything is fine. But when I apply the UDF and query to tables that are 500-1000 gb, I get an Internal Server Error from BQ.

In the end I just need all of the data in new line delimited JSON format in GCS. As a last ditch effort I tried concatenating all of the fields into a JSON string (so that I only had 1 column) in the hopes that I could export it as CSV and have what I need. However, the export process escaped the double quotes and adds double quotes around the JSON string. According to the BQ docs on jobs (https://cloud.google.com/bigquery/docs/reference/v2/jobs) there is a property configuration.query.tableDefinitions.(key).csvOptions.quote that could help me. But I can't figure out how to make it work.

Does anybody have advice on how they have dealt with this sort of situation?

like image 690
Andrew Backes Avatar asked Mar 11 '16 00:03

Andrew Backes

People also ask

How do you optimize joins in BigQuery?

Use INT64 data types in joins to reduce cost and improve comparison performance. Best practice: If your use case supports it, use INT64 data types in joins instead of STRING data types. BigQuery does not index primary keys like traditional databases, so the wider the join column is, the longer the comparison takes.

How do you query a repeated field in BigQuery?

How to Query BigQuery Repeated Fields. To extract information from a repeated field in BigQuery, you must use a more exotic pattern. This is normally done using the UNNEST function, which converts an array of values in a table into rows. These can then be joined to the original table to be queried.

What does Repeated mean in BigQuery?

A repeated field can be accessed as an ARRAY type in Google Standard SQL. A RECORD column can have REPEATED mode, which is represented as an array of STRUCT types. Also, a field within a record can be repeated, which is represented as a STRUCT that contains an ARRAY . An array cannot contain another array directly.

How do I access nested fields in BigQuery?

BigQuery automatically flattens nested fields when querying. To query a column with nested data, each field must be identified in the context of the column that contains it. For example: customer.id refers to the id field in the customer column.

1 Answers

I have never had to do this, but you should be able to use flatten, then join, then use nest to get repeated fields again.

The docs state that BigQuery always flattens query results, but that appears to be false: you can choose to not have results flattened if you set a destination table. You should then be able to export that table as JSON to Storage.

See also this answer for how to get nest to work.

like image 172
oulenz Avatar answered Sep 21 '22 15:09
