Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using subquery for _PARTITIONTIME in bigquery does not limit cost

When I run the below query on BQ using standard SQL, it says it will process 76.6TB when running

SELECT 
event_time, user_id, activity_id,dbm_insertion_order_id, dbm_total_media_cost_usd 
FROM `raw.5295.activity_*`
WHERE _PARTITIONTIME >(SELECT * FROM `analytics-dwh.autobidding.activity_list` )
AND timestamp_micros(event_time)  > (SELECT timestamp_micros(MAX(event_time)) from `essence-analytics-dwh.ml_for_autobidding.nest_na_4q18_activity_updated_daily`)
AND _TABLE_SUFFIX IN ('25','20')

The table analytics-dwh.autobidding.activity_list has just one column with a unique list of integers

If I remove the subquery from the above table, the query uses less than 500GB when run

SELECT 
event_time, user_id, activity_id,dbm_insertion_order_id, dbm_total_media_cost_usd 
FROM `raw.5295.activity_*`
WHERE _PARTITIONTIME >TIMESTAMP('2018-12-20')
AND timestamp_micros(event_time)  > (SELECT timestamp_micros(MAX(event_time)) from `essence-analytics-dwh.ml_for_autobidding.nest_na_4q18_activity_updated_daily`)
AND _TABLE_SUFFIX IN ('25','20')

Why does this happen when I use a subquery? Is there a workaround?

like image 809
akkig Avatar asked Oct 24 '25 20:10

akkig


2 Answers

Why does this happen with use of subquery?

In general, partition pruning will reduce query cost when the filters can be evaluated at the outset of the query without requiring any subquery evaluations or data scans.

You can see more about Limiting partitions queried using pseudo columns

So, in your first query (where you use subquery) - the pruning is not happening (It does not limit use of partitions based on the condition that involves the subquery)

In second query you use _PARTITIONTIME >TIMESTAMP('2018-12-20') so partitions are limited

Bottom line: filters on _PARTITIONTIME that include subqueries can't be used to limit the number of partitions scanned for a partitioned table.

Is there a workaround?

You should split your task to two steps: calculate filter for _PARTITIONTIME using whatever logic you need off of table analytics-dwh.autobidding.activity_list and then use it (instead of subquery) - using any client of your choice

like image 191
Mikhail Berlyant Avatar answered Oct 26 '25 23:10

Mikhail Berlyant


From looking at your query I assume you want to select the MIN or MAX date from this:

SELECT * FROM `analytics-dwh.autobidding.activity_list`

And pass it to the 2nd part of your query

This can be done by using BigQuery API inside a code which will pass the value between the 2 steps (For example Python or Javascript)

like image 35
Tamir Klein Avatar answered Oct 26 '25 22:10

Tamir Klein



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!