A couple of questions about the TABLE_QUERY function:
table_id
in the query string, are there other fields available?TABLE_QUERY()
work? After creating a persistent UDF, you can call it as you would any other function, prepended with the name of the dataset in which it is defined as a prefix. To call a UDF in a project other than the project that you are using to run the query, project_name is required.
Click the join and choose Delete. Create a new join by dragging the column name from the second table to the corresponding column name in the first table. Click Execute SQL. If a preview of your data appears in the Sample Preview pane, the join was successfully created.
If you want to query data based on a time zone other than UTC, choose one of the following options: Adjust for time zone differences in your SQL queries. Use partition decorators to load data into specific ingestion-time partitions, based on a different time zone than UTC.
A SELECT * EXCEPT statement specifies the names of one or more columns to exclude from the result. All matching column names are omitted from the output. Note: SELECT * EXCEPT does not exclude columns that do not have names.
The TABLE_QUERY()
function allows you to write a SQL WHERE
clause that is evaluated to find which tables to run the query over. For instance, you can run the following query to count the rows in all tables in the publicdata:samples
dataset that are older than 7 days:
SELECT count(*) FROM TABLE_QUERY(publicdata:samples, "MSEC_TO_TIMESTAMP(creation_time) < " + "DATE_ADD(CURRENT_TIMESTAMP(), -7, 'DAY')")
Or you can run this to query over all tables that have ‘git’ in the name (which are the github_timeline
and the github_nested
sample tables) and find the most common urls:
SELECT url, COUNT(*) FROM TABLE_QUERY(publicdata:samples, "table_id CONTAINS 'git'") GROUP EACH BY url ORDER BY url DESC LIMIT 100
Despite being very powerful, TABLE_QUERY()
can be difficult to use. The WHERE
clause must be specified as a string, which can be a little bit awkward. Moreover, it can be difficult to debug, since when there is a problem, you only get the error “Error evaluating subsidiary query”, which isn’t always helpful.
How it works:
TABLE_QUERY()
essentially executes two queries. When you run TABLE_QUERY(<dataset>, <table_query>)
, BigQuery executes SELECT table_id FROM <dataset>.__TABLES_SUMMARY__ WHERE <table_query>
to get the list of table IDs to run the query on, then it executes your actual query over those tables.
The __TABLES__
portion of that query may look unfamiliar. __TABLES_SUMMARY__
is a meta-table containing information about tables in a dataset. You can use this meta-table yourself. For example, the query SELECT * FROM publicdata:samples.__TABLES_SUMMARY__
will return metadata about the tables in the publicdata:samples
dataset.
Available Fields:
The fields of the __TABLES_SUMMARY__
meta-table (that are all available in the TABLE_QUERY
query) include:
table_id
: name of the table.creation_time
: time, in milliseconds since 1/1/1970 UTC, that the table was created. This is the same as the creation_time
field on the table.type
: whether it is a view (2) or regular table (1).The following fields are not available in TABLE_QUERY()
since they are members of __TABLES__
but not __TABLES_SUMMARY__
. They're kept here for historical interest and to partially document the __TABLES__
metatable:
last_modified_time
: time, in milliseconds since 1/1/1970 UTC, that the table was updated (either metadata or table contents). Note that if you use the tabledata.insertAll()
to stream records to your table, this might be a few minutes out of date.row_count
: number of rows in the table.size_bytes
: total size in bytes of the table.How to debug
In order to debug your TABLE_QUERY()
queries, you can do the same thing that BigQuery does; that is, you can run the the metatable query yourself. For example:
SELECT * FROM publicdata:samples.__TABLES_SUMMARY__ WHERE MSEC_TO_TIMESTAMP(creation_time) < DATE_ADD(CURRENT_TIMESTAMP(), -7, 'DAY')
lets you not only debug your query but also see what tables would be returned when you run the TABLE_QUERY
function. Once you have debugged the inner query, you can put it together with your full query over those tables.
Alternative answer, for those moving forward to Standard SQL:
For example, to get the average worldwide NOAA GSOD temperature between 2010 and 2014:
#standardSQL
SELECT AVG(temp) avg_temp, _TABLE_SUFFIX y
FROM `bigquery-public-data.noaa.gsod_20*` #every year that starts with "20"
WHERE _TABLE_SUFFIX BETWEEN "10" AND "14" #only years between 2010 and 2014
GROUP BY y
ORDER BY y
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With