We are a gaming company that stores events (Up to 1 giga events per day) to bigquery. Events are sharded over month and application in order to lower query costs.
Now to our problem.
Our current solution supports adding new type of events which leads to new versions of the table schema. This versions has also been added to the tables.
I.e. events_app1_v2_201308 and events_app1_v2_201308
If we add events with new column types in september we will also get events_app1_v3_201309
We have written code that finds out involved tables (for a date range) and makes a union of them a'la bigquery's comma separeted FROM clause.
But I just realised that this will NOT work when we make unions over different versions of the event tables.
Anyone that has a smart solution of how to deal with this!?
Right now we are investigating if JSON structures could help us. The current solution is just flat columns. [timestamp, eventId, value, value, value, ...]
From https://developers.google.com/bigquery/query-reference#from
Note: Unlike many other SQL-based systems, BigQuery uses the comma syntax to indicate table unions, not joins. This means you can run a query over several tables with compatible !? schemas as follows:
The industry solution to handling schema evolution is to include schema information with the data. So, when someone is writing data, they write schema and data both. And when someone wants to read that data, they first read schema and then read data based on the schema.
Schema evolution is a feature that allows users to easily change a table's current schema to accommodate data that is changing over time. Most commonly, it's used when performing an append or overwrite operation, to automatically adapt the schema to include one or more new columns.
Schema is of three types: Logical Schema, Physical Schema and view Schema. Logical Schema – It describes the database designed at logical level. Physical Schema – It describes the database designed at physical level. View Schema – It defines the design of the database at the view level.
To change the schema of a table by using SQL Server Management Studio, in Object Explorer, right-click on the table and then click Design. Press F4 to open the Properties window. In the Schema box, select a new schema. ALTER SCHEMA uses a schema level lock.
You should be able to modify the table schema of the old tables to add columns, then the union should match. Note that you can only add columns, not remove them. You can use the tables.patch() method to do this, or bq update --schema
Moreover, as long as the new fields aren't marked REQUIRED, they should be considered compatible. If this is not the case, however, it would be a bug -- let us know if that is what you're experiencing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With