We are using a public dataset to benchmark BigQuery. We took the same table and partitioned it by day, but it's not clear we are getting many benefits. What's a good balance? <pre class="prettyprint"><code>SELECT sum(score) FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions` WHERE creation_date > "2019-01-01" </code></pre> Takes 1 second, and processes 270.7MB. Same, with partitions: <pre class="prettyprint"><code>SELECT sum(score) FROM `temp.questions_partitioned` WHERE creation_date > "2019-01-01" </code></pre> Takes 2 seconds and processes 14.3 MB. So we see a benefit in MBs processed, but the query is slower. What's a good strategy to decide when to partition? (from an email I received today)

When partitioning a table, you need to consider having enough data for each partition. Think of each partition like being a different file - and opening 365 files might be slower than having a huge one. In this case, the table used for the benchmark has 1.6 GB of data for 2019 (until June in this one). That's 1.6GB/180 = 9 MB of data for each daily partition. For such a low amount of data - arranging it in daily partitions won't bring much benefits. Consider partitioning the data by year instead. See the following question to learn how: <ul> <li>Partition by week/month//quarter/year to get over the partition limit?</li> </ul> Another alternative is not partitioning the table at all, and instead using clustering to sort the data by date. Then BigQuery can choose the ideal size of each block. If you want to run your own benchmarks, do this: <pre class="prettyprint"><code>CREATE TABLE `temp.questions_partitioned` PARTITION BY DATE(creation_date) AS SELECT * FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions` </code></pre> vs no partitions, just clustering by date: <pre class="prettyprint"><code>CREATE TABLE `temp.questions_clustered` PARTITION BY fake_date CLUSTER BY creation_date AS SELECT *, DATE('2000-01-01') fake_date FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions` </code></pre> Then my query over the clustered table would be: <pre class="prettyprint"><code>SELECT sum(score) FROM `temp.questions_clustered` WHERE creation_date > "2019-01-01" </code></pre> And it took 0.5 seconds, 17 MB processed. Compared: <ul> <li>Raw table: 1 sec, 270.7MB</li> <li>Partitioned: 2 sec, 14.3 MB</li> <li>Clustered: 0.5 sec, 17 MB</li> </ul> We have a winner! Clustering organized the daily data (which isn't much for this table) into more efficient blocks than strictly partitioning it by day. It's also interesting to look at the execution details for each query on these tables: Slot time consumed <ul> <li>Raw table: 10.683 sec</li> <li>Partitioned: 7.308 sec</li> <li>Clustered: 0.718 sec</li> </ul> As you can see, the query over raw table used a lot of slots (parallelism) to get the results in 1 second. In this case 50 workers processed the whole table with multiple years of data, reading 17.7M rows. The query over the partitioned table had to use a lot of slots - but this because each slot was assigned smallish daily partitions, a reading that used 153 parallel workers over 0.9M rows. The clustered query instead was able to use a very low amount of slots. Data was well organized to be read by 57 parallel workers, reading 1.12M rows. <img src="https://i.stack.imgur.com/YPCmI.png" alt="enter image description here"> <img src="https://i.stack.imgur.com/oA0Eb.png" alt="enter image description here"> <img src="https://i.stack.imgur.com/DPxKe.png" alt="enter image description here"> See also: <ul> <li>https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b</li> <li>How can I improve the amount of data queried with a partitioned+clustered table?</li> <li>how clustering works in BigQuery</li> </ul>

What's a good balance to decide when to partition a table in BigQuery?

Tags:

database-partitioning

google-bigquery

database-cluster

We are using a public dataset to benchmark BigQuery. We took the same table and partitioned it by day, but it's not clear we are getting many benefits. What's a good balance?

SELECT  sum(score) 
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions` 
WHERE creation_date > "2019-01-01"

Takes 1 second, and processes 270.7MB.

Same, with partitions:

SELECT  sum(score) 
FROM `temp.questions_partitioned` 
WHERE creation_date > "2019-01-01"

Takes 2 seconds and processes 14.3 MB.

So we see a benefit in MBs processed, but the query is slower.

What's a good strategy to decide when to partition?

(from an email I received today)

748

asked Nov 07 '19 06:11

Felipe Hoffa

1 Answers

When partitioning a table, you need to consider having enough data for each partition. Think of each partition like being a different file - and opening 365 files might be slower than having a huge one.

In this case, the table used for the benchmark has 1.6 GB of data for 2019 (until June in this one). That's 1.6GB/180 = 9 MB of data for each daily partition.

For such a low amount of data - arranging it in daily partitions won't bring much benefits. Consider partitioning the data by year instead. See the following question to learn how:

Partition by week/month//quarter/year to get over the partition limit?

Another alternative is not partitioning the table at all, and instead using clustering to sort the data by date. Then BigQuery can choose the ideal size of each block.

If you want to run your own benchmarks, do this:

CREATE TABLE `temp.questions_partitioned`
PARTITION BY DATE(creation_date)
AS
SELECT *
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`

vs no partitions, just clustering by date:

CREATE TABLE `temp.questions_clustered`
PARTITION BY fake_date
CLUSTER BY creation_date
AS

SELECT *, DATE('2000-01-01') fake_date  
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`

Then my query over the clustered table would be:

SELECT sum(score) 
FROM `temp.questions_clustered`
WHERE creation_date > "2019-01-01"

And it took 0.5 seconds, 17 MB processed.

Compared:

Raw table: 1 sec, 270.7MB
Partitioned: 2 sec, 14.3 MB
Clustered: 0.5 sec, 17 MB

We have a winner! Clustering organized the daily data (which isn't much for this table) into more efficient blocks than strictly partitioning it by day.

It's also interesting to look at the execution details for each query on these tables:

Slot time consumed

Raw table: 10.683 sec
Partitioned: 7.308 sec
Clustered: 0.718 sec

As you can see, the query over raw table used a lot of slots (parallelism) to get the results in 1 second. In this case 50 workers processed the whole table with multiple years of data, reading 17.7M rows. The query over the partitioned table had to use a lot of slots - but this because each slot was assigned smallish daily partitions, a reading that used 153 parallel workers over 0.9M rows. The clustered query instead was able to use a very low amount of slots. Data was well organized to be read by 57 parallel workers, reading 1.12M rows.

enter image description here

Felipe Hoffa

Related questions
                            
                                Is there a way to select table_id in a Bigquery Table Wildcard Query
                            
                                Stratified random sampling with BigQuery?
                            
                                Get the Last Modified date for all BigQuery tables in a BigQuery Project
                            
                                BigQuery - Concatenate multiple rows into a single row
                            
                                Determine what project id my App Engine code is running on
                            
                                BigQuery SQL running totals
                            
                                How to generate date series to occupy absent dates in google BiqQuery?
                            
                                BigQuery SQL for 28-day sliding window aggregate (without writing 28 lines of SQL)
                            
                                Knowledge Graph API in BigQuery
                            
                                BigQuery replaced most of my Spark jobs, am I missing something?
                            
                                Is there an Autoincrement in BigQuery?
                            
                                How to download all data in a Google BigQuery dataset?
                            
                                java.net.UnknownHostException Unable to resolve host "accounts.google.com": No address associated with hostname while inserting rows in bigquery
                            
                                Do I need create tables manually in Google BigQuery to view the raw event data from Firebase?
                            
                                MongoDB to BigQuery
                            
                                Indexes on BigQuery Table
                            
                                How to flatten an array with UNNEST or any other functions?
                            
                                google-bigquery format date as mm/dd/yyyy in query results
                            
                                BigQuery - IFERROR for standardSQL
                            
                                How to select multiple custom Firebase event parameters in BigQuery?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's a good balance to decide when to partition a table in BigQuery?

Tags:

database-partitioning

google-bigquery

database-cluster

Felipe Hoffa

People also ask

1 Answers

Felipe Hoffa

Recent Activity

Donate For Us