Dealing with evolving schemas

Tags:

google-bigquery

We are a gaming company that stores events (Up to 1 giga events per day) to bigquery. Events are sharded over month and application in order to lower query costs.

Now to our problem.

Our current solution supports adding new type of events which leads to new versions of the table schema. This versions has also been added to the tables.

I.e. events_app1_v2_201308 and events_app1_v2_201308

If we add events with new column types in september we will also get events_app1_v3_201309

We have written code that finds out involved tables (for a date range) and makes a union of them a'la bigquery's comma separeted FROM clause.

But I just realised that this will NOT work when we make unions over different versions of the event tables.

Anyone that has a smart solution of how to deal with this!?

Right now we are investigating if JSON structures could help us. The current solution is just flat columns. [timestamp, eventId, value, value, value, ...]

From https://developers.google.com/bigquery/query-reference#from

Note: Unlike many other SQL-based systems, BigQuery uses the comma syntax to indicate table unions, not joins. This means you can run a query over several tables with compatible !? schemas as follows:

647

asked Sep 04 '13 08:09

Gunnar Eketrapp

1 Answers

You should be able to modify the table schema of the old tables to add columns, then the union should match. Note that you can only add columns, not remove them. You can use the tables.patch() method to do this, or bq update --schema

Moreover, as long as the new fields aren't marked REQUIRED, they should be considered compatible. If this is not the case, however, it would be a bug -- let us know if that is what you're experiencing.

189

answered Nov 11 '22 01:11

Jordan Tigani

Related questions
                            
                                Partitioning by date?
                            
                                Use of TABLE_DATE_RANGE function with table decorators
                            
                                Synchronize Amazon RDS with Google BigQuery
                            
                                How to do repeatable sampling in BigQuery Standard SQL?
                            
                                Using Python to Query GCP Stackdriver logs
                            
                                Feeding nullable data from BigQuery into Tensorflow Transform
                            
                                Pandas Equivalent for SQL window function and rows range
                            
                                Bigquery cancel or stop a batch query job which is not started yet (Status.State = "PENDING")
                            
                                "My Alias" is an inline table, and so cannot be in outer part of an outer join
                            
                                BigQuery CASE WHEN ELSE statement incomplete results
                            
                                Using TABLE_DATE_RANGE with more than 1 year's worth of tables
                            
                                Is there a way to determine or specify what geo region BigQuery stores data in?
                            
                                Bigquery Shard Vs Bigquery Partition
                            
                                Discrepancies on "active users metric" between Firebase Analytics dashboard and BigQuery export
                            
                                Best way to loop through parameters in Airflow?
                            
                                Is there a metadata table to check if the table in BigQuery is partitioned?
                            
                                What are the pros and cons of loading data directly into Google BigQuery vs going through Cloud Storage first?
                            
                                Migrate csv from gcs to postgresql
                            
                                BigQuery - Transfers automation from Google Cloud Storage - Overwrite table
                            
                                Is there a way around casting large integers as string when querying data from BigQuery through R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With