Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I have daily tables on BigQuery. How to query the "newest" one?

I create daily tables. How to query the "newest" one?

For example GitHubArchive now publishes daily tables - instead of a monolithic one (see /r/bigquery/.../github_archive_changes_monthly_and_daily_tables/).

like image 201
Felipe Hoffa Avatar asked Feb 10 '15 01:02

Felipe Hoffa


People also ask

How do I update a scheduled query in BigQuery?

To change the start time, select the Start at set time option, enter the desired start date and time, and then save. Note: If the specified start time is later than the time in the schedule, then the first run of the query will be in the next iteration of the cycle.

Can BigQuery tables update?

The BigQuery data manipulation language (DML) enables you to update, insert, and delete data from your BigQuery tables. You can execute DML statements just as you would a SELECT statement, with the following conditions: You must use Google Standard SQL.


2 Answers

To find the latest table, you could use BigQuery's table query functions:

For example, the latest daily GitHubArchive table:

SELECT COUNT(*) num_records
FROM TABLE_QUERY(githubarchive:day,
  "table_id IN (
     SELECT table_id FROM githubarchive:day.__TABLES__ 
     ORDER BY creation_time DESC LIMIT 1)")

# 201859

For maximum convenience, you can store that query as a view to share with others. Then to query the latest table, just query the view:

SELECT num_records
FROM [fh-bigquery:public_dump.githubarchive_latest_day]

# 201859

For more on table query functions, see Jordan's answer to How do I use the TABLE_QUERY() function in BigQuery?.

like image 159
Felipe Hoffa Avatar answered Oct 02 '22 05:10

Felipe Hoffa


Felipe's answer uses Legacy SQL. Here's a solution in Standard SQL.

If your table names have a predictable suffix like GitHub's example in your link (which uses a YYYYMMDD suffix for each day), you can use wildcard tables and TABLE_SUFFIX:

SELECT COUNT(*)
FROM `githubarchive.day.events_*`
WHERE _TABLE_SUFFIX = (
  SELECT MAX(SUBSTR(table_id, -8))
  FROM `githubarchive.day.__TABLES_SUMMARY__`
  WHERE REGEXP_CONTAINS(table_id, "events_\\d{8}")
)

If you want to rely on BigQuery's metadata about creation time instead of the prefix, you can do:

SELECT COUNT(*)
FROM `githubarchive.day.*`
WHERE _TABLE_SUFFIX = (
  SELECT table_id
  FROM `githubarchive.day.__TABLES_SUMMARY__`
  ORDER BY creation_time DESC
  LIMIT 1
)

Note that if you have tables with different schemas in your project, the latter syntax is dangerous, since the query will use the schema of the first table it sees in the wildcard.

If you're trying to create tables that are partitioned by date, there's a BigQuery feature called date-partitioned tables that does just that, and makes it easier to query afterwards.

like image 37
Ted Avatar answered Oct 02 '22 06:10

Ted