I can see from public documentation that BigQuery partition table has this limitation that if the partition column has a subquery as a filter, it won't prune the queried partition and reduce "bytes processed"(cost). I'm wondering if there is a way to workaround.
For example, this query will scan 38.67 GB, is there a way to reduce it?
WITH sub_query_that_generates_filter AS (
SELECT DATE "2016-10-01" as month UNION ALL
SELECT "2017-10-01" UNION ALL
SELECT "2018-10-01"
)
SELECT block_hash, fee FROM `bigquery-public-data.crypto_bitcoin.transactions`
WHERE block_timestamp_month in
(SELECT month FROM sub_query_that_generates_filter)
A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query.
If you want to query data based on a time zone other than UTC, choose one of the following options: Adjust for time zone differences in your SQL queries. Use partition decorators to load data into specific ingestion-time partitions, based on a different time zone than UTC.
A partition function is a database object that defines how the rows of a table or index are mapped to a set of partitions based on the values of a certain column, called a partitioning column. Each value in the partitioning column is an input to the partitioning function, which returns a partition value.
With BigQuery scripting, there is a way to reduce the cost.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
CREATE TEMP TABLE sub_query_that_generates_filter AS (
SELECT DATE "2017-10-01" as month UNION ALL
SELECT "2018-10-01" UNION ALL
SELECT "2016-10-01"
);
BEGIN
DECLARE month_filter ARRAY<DATE>
DEFAULT (SELECT ARRAY_AGG(month) FROM sub_query_that_generates_filter);
SELECT block_hash, fee FROM `bigquery-public-data.crypto_bitcoin.transactions`
WHERE block_timestamp_month in UNNEST(month_filter);
END
It scans only 2GB of data instead of 38GB. Cheaper and faster!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With