I want to log events from my client component and analyze them in google's BigQuery. My problem is that the events are of several different types (with potential for more types to be added in the future) - each event type has a different number and types of properties. For example: {"event":"action", &emsp;&emsp;&emsp;&emsp;"properties":{"ts":1384441115, &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;"distinct_id":"5EB54670", &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;"action_type":"pause", &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;"time":"5"}} {"event":"action", &emsp;&emsp;&emsp;&emsp;"properties":{"ts":1384441115, &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;"distinct_id":"5EB54670", &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;"action_type":"resume", &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;"time":"15"}} {"event":"section", &emsp;&emsp;&emsp;&emsp;"properties":{"ts":1384441115, &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;"distinct_id":"5EB54670", &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;"section_name":"end", &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;"dl_speed":"0.5 Mbit/s", &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;"time":"25"}} My question is - how do I handle this diversity in a tabular DB? My reason for choosing BigQuery is its ability to handle big data calculation and analysis of my logged events, but for that to happen I need to figure out the best practice to log these events. I thought about 2 options: 1. have a large table that has columns for every property of every event type - in this case every row will contain empty fields. 2. have a separate table for each event type - this raises two issues - future events will call for new tables, and even worst - I loose ability to perform calculations over all events (seeing as all events share some properties like ts, distinct_id and time) I'm pretty sure I am not inventing the wheel with my use-case, so I would love to hear about the best practices from you guys. Thanks! Amit

You have a number of options: <ol> <li> Use wide schema. You can have a column for every property type. You can add columns to the table by using the tables.update() method. While it may seem inefficient to have a lot of null columns, this is actually the most efficient way to store and query your data. Null values don't cost anything to store, (e.g. if you have a table with a million rows and a column that only has 10 rows with a value and the rest are null, you only get charged for storage of the 10 values). Even better, null values don't cost anything to query either. Having a wide table schema will mean that your queries are less expensive, since you won't be reading all of the properties on each query, just the columns that you care about. </li> <li> Store the properties in a repeated field as key-value pairs. In that case, you'll likely need a keyword that we haven't yet documented -- OMIT ... IF. This is a pretty clean way of doing it, you'd end up with queries that look like <pre class="prettyprint"><code>SELECT properties.value FROM my_table OMIT properties IF properties.name <> "dl_speed" </code></pre> Of course, some queries could get pretty awkward in this scenario. </li> <li>Store the properties in a JSON field, and extract the field names you need in the query. We've recently added a couple of functions that will make this easy and efficient, however they haven't quite made it to production yet. I'll try to remember to update this answer when these go live, which will hopefully be today, but release schedules in december can be unpredictable.</li> <li>I'd recommend against having a separate table to join against. While this is the common way to do things in a relational-database world, this is going to be less efficient in BigQuery. We usually recommend that you denormalize your data.</li> </ol>

BigQuery for logging events of different types with different properties

Tags:

logging

google-bigquery

I want to log events from my client component and analyze them in google's BigQuery. My problem is that the events are of several different types (with potential for more types to be added in the future) - each event type has a different number and types of properties.

For example:

{"event":"action",
"properties":{"ts":1384441115,
"distinct_id":"5EB54670",
"action_type":"pause",
"time":"5"}}

{"event":"action",
"properties":{"ts":1384441115,
"distinct_id":"5EB54670",
"action_type":"resume",
"time":"15"}}

{"event":"section",
"properties":{"ts":1384441115,
"distinct_id":"5EB54670",
"section_name":"end",
"dl_speed":"0.5 Mbit/s",
"time":"25"}}

My question is - how do I handle this diversity in a tabular DB? My reason for choosing BigQuery is its ability to handle big data calculation and analysis of my logged events, but for that to happen I need to figure out the best practice to log these events.

I thought about 2 options:
1. have a large table that has columns for every property of every event type - in this case every row will contain empty fields.
2. have a separate table for each event type - this raises two issues - future events will call for new tables, and even worst - I loose ability to perform calculations over all events (seeing as all events share some properties like ts, distinct_id and time)

I'm pretty sure I am not inventing the wheel with my use-case, so I would love to hear about the best practices from you guys. Thanks!

Amit

715

asked Dec 11 '13 13:12

Amit

1 Answers

You have a number of options:

Use wide schema. You can have a column for every property type. You can add columns to the table by using the tables.update() method. While it may seem inefficient to have a lot of null columns, this is actually the most efficient way to store and query your data.

Null values don't cost anything to store, (e.g. if you have a table with a million rows and a column that only has 10 rows with a value and the rest are null, you only get charged for storage of the 10 values). Even better, null values don't cost anything to query either. Having a wide table schema will mean that your queries are less expensive, since you won't be reading all of the properties on each query, just the columns that you care about.
Store the properties in a repeated field as key-value pairs. In that case, you'll likely need a keyword that we haven't yet documented -- OMIT ... IF. This is a pretty clean way of doing it, you'd end up with queries that look like
```
SELECT properties.value FROM my_table
OMIT properties IF properties.name <> "dl_speed"
```
Of course, some queries could get pretty awkward in this scenario.
Store the properties in a JSON field, and extract the field names you need in the query. We've recently added a couple of functions that will make this easy and efficient, however they haven't quite made it to production yet. I'll try to remember to update this answer when these go live, which will hopefully be today, but release schedules in december can be unpredictable.
I'd recommend against having a separate table to join against. While this is the common way to do things in a relational-database world, this is going to be less efficient in BigQuery. We usually recommend that you denormalize your data.

147

answered Sep 29 '22 02:09

Jordan Tigani

Related questions
                            
                                Google App Engine Java and Android Getting Started
                            
                                Google cloud storage integration in iPhone App
                            
                                identify group by vs group each in advance
                            
                                Android GCM Unauthorized 401 error with PHP
                            
                                Consequences for changing a model on Google app engine
                            
                                In GoogleCloudMessaging API, how to handle the renewal or expiration of registration ID?
                            
                                App Engine return JSON from JsonProperty
                            
                                Best way to profile/optimize a website on Google App Engine
                            
                                Unregister a device from GCM using registration Id in Android
                            
                                Cloud Functions ERROR: cannot convert an array value in an array value [closed]
                            
                                Google App Engine - Node: Cannot find module 'firebase-admin'
                            
                                I'm getting an error "Error type 'AuthResult' is not a subtype of type 'FirebaseUser' in type cast" when I'm trying to login or signup
                            
                                AjaxForm and app engine blobstore
                            
                                How to synchronize Firestore rules and indexes? [closed]
                            
                                How do i authenticate a rest call in firebase?
                            
                                HTML : How to retain formatting in textarea?
                            
                                Looking for example using MediaFileUpload
                            
                                Merged Manifest Warning after upgrading Android Studio to 3.2.1
                            
                                Firebase: how to check if user is logged in?
                            
                                How to request GPU quota increase in Google Cloud

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

BigQuery for logging events of different types with different properties

Tags:

logging

google-bigquery

Amit

People also ask

1 Answers

Jordan Tigani

Recent Activity

Donate For Us