Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does REPEATED field in Google Bigquery mean?

Tags:

Please check my understanding of REPEATED field in the following examples:

{     "title": "History of Alphabet",     "author": [         {             "name": "Larry"         },     ] } 

This JSON has schema:

[     {         "name": "title",         "type": "STRING"     },     {         "name": "author",         "type": "RECORD",         "fields": [             {                 "name": "name",                 "type": "STRING"             }         ]     } ] 

But the following JSON

{     "title": "History of Alphabet",     "author": ["Larry", "Steve", "Eric"] } 

has schema:

[     {         "name": "title",         "type": "STRING"     },     {         "name": "author",         "type": "STRING",         "mode": "REPEATED"     } ] 

Is this correct?

nb: I tried to go through the documentation, but can't find any explanation about this.

like image 737
hans-t Avatar asked Aug 15 '15 01:08

hans-t


People also ask

How do you query a repeated field in BigQuery?

How to Query BigQuery Repeated Fields. To extract information from a repeated field in BigQuery, you must use a more exotic pattern. This is normally done using the UNNEST function, which converts an array of values in a table into rows. These can then be joined to the original table to be queried.

What is a repeated column?

A repeated column is a column that can contain multiple values per row. For example, the column [Cities lived] in the data table below lists every city that the person has lived in. It could be just one city, or it could be many different places. In Spotfire, data tables with repeated columns are flattened.

Which property does BigQuery use to de duplicate data in a streaming job?

BigQuery uses the insertId property for de-duplication. Hope this helps!

What is BigQuery not good for?

However, despite its unique advantages and powerful features, BigQuery is not a silver bullet. It is not recommended to use it on data that changes too often and, due to its storage location bound to Google's own services and processing limitations it's best not to use it as a primary data storage.


1 Answers

Close. In your first example, author is an array of objects, which corresponds to a repeated record in BQ. So the schema would be:

[     {         "name": "title",         "type": "STRING"     },     {         "name": "author",         "type": "RECORD",         "mode": "REPEATED",   <--- NOTE!         "fields": [             {                 "name": "name",                 "type": "STRING"             }         ]     } ] 

Your second data/schema pair looks good (but note that the overall schema is an array, not an object, and it needs commas between elements).

There is some discussion of nested and repeated fields here: https://cloud.google.com/bigquery/docs/data?hl=en#nested

There are also some sample JSON data objects here: https://cloud.google.com/bigquery/preparing-data-for-bigquery#dataformats

But I agree we don't do a good job of explaining how those objects map to BQ schemas. Sorry about that!

like image 154
Jeremy Condit Avatar answered Dec 23 '22 19:12

Jeremy Condit