Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery best practice for segmenting tables by dates

I am new to columnar DB concepts and BigQuery in particular. I noticed that for the sake of performance and cost efficiency it is recommended to split data across tables not only logically - but also by time.

For example - while I need a table to store my logs (1 logical table that is called "logs"), it is actually considered a good practice to have a separate table for different periods, like "logs_2012", "logs_2013", etc... or even "logs_2013_01", "logs_2013_02", etc...

My questions:

1) Is it actually the best practice?

2) Where would be best to draw the line - an annual table? A monthly table? A daily table? You get the point...

3) In terms of retrieving the data via queries - what is the best approach? Should I construct my queries dynamically using the UNION option? If I had all my logs in one table - I would naturally use the where clause to get data for the desired time range, but having data distributed over multiple tables makes it weird. I come from the world of relational DB (if it wasn't obvious so far) and I'm trying to make the leap as smoothly as possible...

4) Using the distributed method (different tables for different periods) still raises the following question: before querying the data itself - I want to be able to determine for a specific log type - what is the available range for querying. For example - for a specific machine I would like to first present to my users the relevant scope of their available logs, and let them choose the specific period within that scope to get insights for. The question is - how do I construct such a query when my data is distributed over a number of tables (each for a period) where I don't know which tables are available? How can I construct a query when I don't know which tables exist? I might try to access the table "logs_2012_12" when this table doesn't actually exist, or event worst - I wouldn't know which tables are relevant and available for my query.

Hope my questions make sense...

Amit

like image 961
Amit Avatar asked Dec 12 '13 12:12

Amit


1 Answers

Table naming

For daily tables, the suggested table name pattern is the specific name of your table + the date like in '20131225'. For example, "logs20131225" or "logs_20131225".

Ideal aggregation: Day, month, year?

The answer to this question will depend on your data and your queries.

  • Will you usually query one or two days of data? Then have daily tables, and your costs will be much lower, as you query only the data you need.
  • Will you usually query all your data? Then have all the data in one table. Having many tables in one query can get slower as the number of tables to query grow.
  • If in doubt, do both! You could have daily, monthly, yearly tables. For a small storage cost, you could save a lot when doing queries that target only the intended data.

Unions

Feel free to do unions.

  • Keep in mind that there is a limit of a 1000 tables per query. This means if you have daily tables, you won't be able to query 3 years of data (3*365 > 1000).
  • Remember that unions in BigQuery don't use the UNION keyword, but the "," that other databases use for joins. Joins in BigQuery can be done with the explicit SQL keyword JOIN (or JOIN EACH for very big joins).

Table discovery

  • API: tables.list will list all tables in a dataset, through the API.
  • SQL: To query the list of tables within SQL... keep tuned.
like image 189
Felipe Hoffa Avatar answered Dec 04 '22 08:12

Felipe Hoffa