Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When I query a partitioned table, is it possible to filter by partition column with a subquery and reduce cost at the same time?

I can see from public documentation that BigQuery partition table has this limitation that if the partition column has a subquery as a filter, it won't prune the queried partition and reduce "bytes processed"(cost). I'm wondering if there is a way to workaround.

For example, this query will scan 38.67 GB, is there a way to reduce it?

WITH sub_query_that_generates_filter AS (
  SELECT DATE "2016-10-01" as month UNION ALL
  SELECT "2017-10-01" UNION ALL
  SELECT "2018-10-01"
)
SELECT block_hash, fee FROM `bigquery-public-data.crypto_bitcoin.transactions`
WHERE block_timestamp_month in 
(SELECT month FROM sub_query_that_generates_filter)
like image 292
Yun Zhang Avatar asked Oct 03 '19 20:10

Yun Zhang


People also ask

What is a benefit of querying a table partitioned by column?

A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query.

How would you query specific partitions in a BigQuery table?

If you want to query data based on a time zone other than UTC, choose one of the following options: Adjust for time zone differences in your SQL queries. Use partition decorators to load data into specific ingestion-time partitions, based on a different time zone than UTC.

What is partitioned query?

A partition function is a database object that defines how the rows of a table or index are mapped to a set of partitions based on the values of a certain column, called a partitioning column. Each value in the partitioning column is an input to the partitioning function, which returns a partition value.


1 Answers

With BigQuery scripting, there is a way to reduce the cost.

Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.

CREATE TEMP TABLE sub_query_that_generates_filter AS (
  SELECT DATE "2017-10-01" as month UNION ALL
  SELECT "2018-10-01" UNION ALL
  SELECT "2016-10-01" 
);
BEGIN
  DECLARE month_filter ARRAY<DATE> 
    DEFAULT (SELECT ARRAY_AGG(month) FROM sub_query_that_generates_filter);

  SELECT block_hash, fee FROM `bigquery-public-data.crypto_bitcoin.transactions` 
    WHERE block_timestamp_month in UNNEST(month_filter);
END

It scans only 2GB of data instead of 38GB. Cheaper and faster!

enter image description here

like image 61
Yun Zhang Avatar answered Sep 28 '22 08:09

Yun Zhang