Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to choose the latest partition in BigQuery table?

I am trying to select data from the latest partition in a date-partitioned BigQuery table, but the query still reads data from the whole table.

I've tried (as far as I know, BigQuery does not support QUALIFY):

SELECT col FROM table WHERE _PARTITIONTIME = (
  SELECT pt FROM (
    SELECT pt, RANK() OVER(ORDER by pt DESC) as rnk FROM (
      SELECT _PARTITIONTIME AS pt FROM table GROUP BY 1)
    )
  )
  WHERE rnk = 1
);

But this does not work and reads all rows.

SELECT col from table WHERE _PARTITIONTIME = TIMESTAMP('YYYY-MM-DD')

where 'YYYY-MM-DD' is a specific date does work.

However, I need to run this script in the future, but the table update (and the _PARTITIONTIME) is irregular. Is there a way I can pull data only from the latest partition in BigQuery?

like image 440
cshin9 Avatar asked Sep 27 '16 20:09

cshin9


2 Answers

Sorry for digging up this old question, but it came up in a Google search and I think the accepted answer is misleading.

As far as I can tell from the documentation and running tests, the accepted answer will not prune partitions because a subquery is used to determine the most recent partition:

Complex queries that require the evaluation of multiple stages of a query in order to resolve the predicate (such as inner queries or subqueries) will not prune partitions from the query.

So, although the suggested answer will deliver the results you expect, it will still query all partitions. It will not ignore all older partitions and only query the latest.

The trick is to use a more-or-less-constant to compare to, instead of a subquery. For example, if _PARTITIONTIME isn't irregular but daily, try pruning partitions by getting yesterdays partition like so:

SELECT * FROM [dataset.partitioned_table]
    WHERE _PARTITIONDATE = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)

Sure, this isn't always the latest data, but in my case this happens to be close enough. Use INTERVAL 0 DAY if you want todays data, and don't care that the query will return 0 results for the part of the day where the partition hasn't been created yet.

I'm happy to learn if there is a better workaround to get the latest partition!

like image 78
Patrick Atoon Avatar answered Sep 19 '22 12:09

Patrick Atoon


October 2019 Update

Support for Scripting and Stored Procedures is now in beta (as of October 2019)

You can submit multiple statements separated with semi-colons and BigQuery is able to run them now

See example below

DECLARE max_date TIMESTAMP;
SET max_date = (
  SELECT MAX(_PARTITIONTIME) FROM project.dataset.partitioned_table`);

SELECT * FROM `project.dataset.partitioned_table`
WHERE _PARTITIONTIME = max_date;

Update for those who like downvoting without checking context, etc.

I think, this answer was accepted because it addressed the OP's main question Is there a way I can pull data only from the latest partition in BigQuery? and in comments it was mentioned that it is obvious that BQ engine still scans ALL rows but returns result based on ONLY recent partition. As it was already mentioned in comment for question - Still something that easily to be addressed by having that logic scripted - first getting result of subquery and then use it in final query

Try

SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
  SELECT MAX(TIMESTAMP(partition_id))
  FROM [dataset.partitioned_table$__PARTITIONS_SUMMARY__]
)  

or

SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
  SELECT MAX(_PARTITIONTIME) 
  FROM [dataset.partitioned_table]
)  
like image 27
Mikhail Berlyant Avatar answered Sep 19 '22 12:09

Mikhail Berlyant