Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use the TABLE_QUERY() function in BigQuery?

A couple of questions about the TABLE_QUERY function:

  • The examples show using table_id in the query string, are there other fields available?
  • It seems difficult to debug. I'm getting "error evaluating subsidiary query" when I try to use it.
  • How does TABLE_QUERY() work?
like image 877
Jordan Tigani Avatar asked Mar 29 '14 18:03

Jordan Tigani


People also ask

How do you call a function in BigQuery?

After creating a persistent UDF, you can call it as you would any other function, prepended with the name of the dataset in which it is defined as a prefix. To call a UDF in a project other than the project that you are using to run the query, project_name is required.

How do I join two tables in a large query?

Click the join and choose Delete. Create a new join by dragging the column name from the second table to the corresponding column name in the first table. Click Execute SQL. If a preview of your data appears in the Sample Preview pane, the join was successfully created.

How would you query specific partitions in a BigQuery table?

If you want to query data based on a time zone other than UTC, choose one of the following options: Adjust for time zone differences in your SQL queries. Use partition decorators to load data into specific ingestion-time partitions, based on a different time zone than UTC.

How do I select all columns except one in BigQuery?

A SELECT * EXCEPT statement specifies the names of one or more columns to exclude from the result. All matching column names are omitted from the output. Note: SELECT * EXCEPT does not exclude columns that do not have names.


2 Answers

The TABLE_QUERY() function allows you to write a SQL WHERE clause that is evaluated to find which tables to run the query over. For instance, you can run the following query to count the rows in all tables in the publicdata:samples dataset that are older than 7 days:

SELECT count(*) FROM TABLE_QUERY(publicdata:samples,     "MSEC_TO_TIMESTAMP(creation_time) < "     + "DATE_ADD(CURRENT_TIMESTAMP(), -7, 'DAY')") 

Or you can run this to query over all tables that have ‘git’ in the name (which are the github_timeline and the github_nested sample tables) and find the most common urls:

SELECT url, COUNT(*) FROM TABLE_QUERY(publicdata:samples, "table_id CONTAINS 'git'") GROUP EACH BY url ORDER BY url DESC LIMIT 100 

Despite being very powerful, TABLE_QUERY() can be difficult to use. The WHERE clause must be specified as a string, which can be a little bit awkward. Moreover, it can be difficult to debug, since when there is a problem, you only get the error “Error evaluating subsidiary query”, which isn’t always helpful.

How it works:

TABLE_QUERY() essentially executes two queries. When you run TABLE_QUERY(<dataset>, <table_query>), BigQuery executes SELECT table_id FROM <dataset>.__TABLES_SUMMARY__ WHERE <table_query> to get the list of table IDs to run the query on, then it executes your actual query over those tables.

The __TABLES__ portion of that query may look unfamiliar. __TABLES_SUMMARY__ is a meta-table containing information about tables in a dataset. You can use this meta-table yourself. For example, the query SELECT * FROM publicdata:samples.__TABLES_SUMMARY__ will return metadata about the tables in the publicdata:samples dataset.

Available Fields:

The fields of the __TABLES_SUMMARY__ meta-table (that are all available in the TABLE_QUERY query) include:

  • table_id: name of the table.
  • creation_time: time, in milliseconds since 1/1/1970 UTC, that the table was created. This is the same as the creation_time field on the table.
  • type: whether it is a view (2) or regular table (1).

The following fields are not available in TABLE_QUERY() since they are members of __TABLES__ but not __TABLES_SUMMARY__. They're kept here for historical interest and to partially document the __TABLES__ metatable:

  • last_modified_time: time, in milliseconds since 1/1/1970 UTC, that the table was updated (either metadata or table contents). Note that if you use the tabledata.insertAll() to stream records to your table, this might be a few minutes out of date.
  • row_count: number of rows in the table.
  • size_bytes: total size in bytes of the table.

How to debug

In order to debug your TABLE_QUERY() queries, you can do the same thing that BigQuery does; that is, you can run the the metatable query yourself. For example:

SELECT * FROM publicdata:samples.__TABLES_SUMMARY__  WHERE MSEC_TO_TIMESTAMP(creation_time)  <     DATE_ADD(CURRENT_TIMESTAMP(), -7, 'DAY') 

lets you not only debug your query but also see what tables would be returned when you run the TABLE_QUERY function. Once you have debugged the inner query, you can put it together with your full query over those tables.

like image 129
Jordan Tigani Avatar answered Oct 02 '22 07:10

Jordan Tigani


Alternative answer, for those moving forward to Standard SQL:

  • BigQuery Standard SQL doesn't support TABLE_QUERY, but it supports * expansion for table names.
  • When expanding table names *, you can use the meta-column _TABLE_SUFFIX to narrow the selection.
  • Table expansion with * only works when all tables have compatible schemas.

For example, to get the average worldwide NOAA GSOD temperature between 2010 and 2014:

#standardSQL
SELECT AVG(temp) avg_temp, _TABLE_SUFFIX y
FROM `bigquery-public-data.noaa.gsod_20*` #every year that starts with "20"
WHERE _TABLE_SUFFIX BETWEEN "10" AND "14" #only years between 2010 and 2014
GROUP BY y
ORDER BY y
like image 28
Felipe Hoffa Avatar answered Oct 02 '22 06:10

Felipe Hoffa