I'm trying to optimize our BigQuery model by using clustered tables.
I'm testing these scenarios:
Without apply any where condition scenario 1 and 2 have equal cost(time and bytes processed). When I apply a condition by a clustered field 1 is 4x time faster and cheaper.
Clustered fields are only useful when you use a condition in the query? and not in a join? In this case, if I perform a join without any condition, the performance is the same with or without the clusters
How can I improve a join between two tables in BigQuery?
EDIT 2021-05-31
Add query execution plan of both jobs:
Clustered
Non-clustered
Google BigQuery does not support other join types, such as a full outer join or right outer join. In addition, Google BigQuery uses the default equals (=) operator to compare columns and does not support other operators.
You can create a clustered table by querying either a partitioned table or a non-partitioned table. You cannot change an existing table to a clustered table by using query results.
Partition pruning is done before the query runs, so you can get the query cost after partitioning pruning through a dry run. Cluster pruning is done when the query runs, so the cost is known only after the query finishes. You need partition-level management.
From docs, I would say that the cluster will be simply ignore as you are comparing using another column, during the Join.
Now, for optimizing a join, you could try to reduce data before the join. For instance, try filtering the tables, or pre-aggregating them to reduce as much data as you can. Finally, also take care of the order of the tables on the join. Order them from the biggest to the smallest one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With