Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing JSON in BigQuery

I have various highly nested json objects. I am wondering whether to store these as STRUCTs in BigQuery or as a STRING. If storing it as a string, then I can use JSON_EXTRACT where necessary to get what I need. I have a few questions on using the following approach:

  • Would it be a bad idea storing json data as a string instead of record?
  • Would there be a big performance hit whenever using that json field if it's stored as a string?
  • What additional advantages would storing the json as a STRUCT instead of a string give?

Finally, I wasn't able to find any place in the documentation that gives examples of how to query STRUCTs. The only place I could find was https://cloud.google.com/bigquery/docs/nested-repeated. Are there examples in the documentation (or elsewhere) on querying nested fields? Additionally, why is the term RECORD and STRUCT used interchangeably on this page?

Note that the json will not be repeated at the root level, i.e., it will look like {...} and not [{...},{...}].

As a reference, in Redshift you would (as of this question) store json as a string and use the json-functions to manipulate it: https://stackoverflow.com/a/32731374/651174.

like image 219
David542 Avatar asked Jun 04 '19 18:06

David542


People also ask

How is JSON data stored in BigQuery?

The other common way JSON data is stored in BigQuery is as STRING data type. For example: Storing nested data as plain string requires no pre-defined schema, but it will bring you headaches in the future:

How to load newline delimited JSON data into BigQuery?

You can load newline delimited JSON data from Cloud Storage into a new table or partition, or append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).

How do I change how BigQuery parses JSON data?

To change how BigQuery parses JSON data, specify additional options in the Cloud Console, the bq command-line tool, the API, or the client libraries. (Optional) The maximum number of bad records that BigQuery can ignore when running the job. If the number of bad records exceeds this value, an invalid error is returned in the job result.

Can BigQuery read multiple JSON files in parallel?

Each JSON object must be on a separate line in the file. If you use gzip compression , BigQuery cannot read the data in parallel. Loading compressed JSON data into BigQuery is slower than loading uncompressed data.


1 Answers

I usually do both:

  • Store JSON objects as STRINGs for posterity and re-factorings.
  • Materialize easy-to-query tables from your JSON objects - to get you and your team a better experience when querying.

My 3 steps:

  1. Store everything as JSON strings. Then you won't lose data in case of schema changes, for example.
  2. Create a VIEW that JSON_EXTRACTs data into easy to query columns.
  3. Materialize those views into tables for the best performance and ease.

Then, in case of schema change:

  1. Everything you have stored, stays the same.
  2. You can modify the views to suit the new schema.
  3. You can re-materialize tables into the new schema.
like image 162
Felipe Hoffa Avatar answered Oct 17 '22 12:10

Felipe Hoffa