Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bigquery Shard Vs Bigquery Partition

I have a table with 340GB of data, but we use only last one week of data. So to minimize the cost planning to move this data to partition table or shard tables.

I have done some experiment with shard tables and partition. I have created partition table and loaded two days worth of data(two partitions) and created two shard tables(Individual tables). I tried to pull last two days worth of data.

Full table - 27sec Partition Table - 33 sec shard tables - 91 sec

Please let me know which way is best. Based on the experiment result is giving quick when I run against full table but full table will scan.

Thanks,

like image 278
user374374 Avatar asked Sep 15 '16 15:09

user374374


People also ask

What is the difference between partitioning and sharding?

Sharding and partitioning are both about breaking up a large data set into smaller subsets. The difference is that sharding implies the data is spread across multiple computers while partitioning does not. Partitioning is about grouping subsets of data within a single database instance.

What is difference between partitioning and clustering in BigQuery?

Like clustering, partitioning uses user-defined partition columns to specify how data is partitioned and what data is stored in each partition. Unlike clustering, partitioning provides granular query cost estimates before you run a query.

What is the difference between partition and cluster?

A partitioned table is a table divided to sections by partitions. Dividing a large table into smaller partitions allows for improved performance and reduced costs by controlling the amount of data retrieved from a query. Clustering sorts the data based on one or more columns in the table.

Can clustering be done without partitioning in BigQuery?

Because of Clustering, BigQuery takes less time to process the data as the required columns are kept together. Clustering improves efficiency, but there are some limitations: Clustering is only supported for partitioned tables. We can specify the clustering column only while creating a table.


1 Answers

From GCP official documentation on Partitioning versus Sharding you should use Partitioned tables.

Partitioned tables perform better than tables sharded by date. When you create date-named tables, BigQuery must maintain a copy of the schema and metadata for each date-named table. Also, when date-named tables are used, BigQuery might be required to verify permissions for each queried table. This practice also adds to query overhead and impacts query performance. The recommended best practice is to use partitioned tables instead of date-sharded tables.

like image 59
vdolez Avatar answered Oct 11 '22 17:10

vdolez