I have a table with 340GB of data, but we use only last one week of data. So to minimize the cost planning to move this data to partition table or shard tables.
I have done some experiment with shard tables and partition. I have created partition table and loaded two days worth of data(two partitions) and created two shard tables(Individual tables). I tried to pull last two days worth of data.
Full table - 27sec Partition Table - 33 sec shard tables - 91 sec
Please let me know which way is best. Based on the experiment result is giving quick when I run against full table but full table will scan.
Thanks,
Sharding and partitioning are both about breaking up a large data set into smaller subsets. The difference is that sharding implies the data is spread across multiple computers while partitioning does not. Partitioning is about grouping subsets of data within a single database instance.
Like clustering, partitioning uses user-defined partition columns to specify how data is partitioned and what data is stored in each partition. Unlike clustering, partitioning provides granular query cost estimates before you run a query.
A partitioned table is a table divided to sections by partitions. Dividing a large table into smaller partitions allows for improved performance and reduced costs by controlling the amount of data retrieved from a query. Clustering sorts the data based on one or more columns in the table.
Because of Clustering, BigQuery takes less time to process the data as the required columns are kept together. Clustering improves efficiency, but there are some limitations: Clustering is only supported for partitioned tables. We can specify the clustering column only while creating a table.
From GCP official documentation on Partitioning versus Sharding you should use Partitioned tables.
Partitioned tables perform better than tables sharded by date. When you create date-named tables, BigQuery must maintain a copy of the schema and metadata for each date-named table. Also, when date-named tables are used, BigQuery might be required to verify permissions for each queried table. This practice also adds to query overhead and impacts query performance. The recommended best practice is to use partitioned tables instead of date-sharded tables.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With