Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery: join on clusters fields

I'm trying to optimize our BigQuery model by using clustered tables.

I'm testing these scenarios:

  1. Clustered table join clustered table (join by clusters fields)
  2. Not Clustered table join Not clustered table (join by the same fields that scenario 1)

Without apply any where condition scenario 1 and 2 have equal cost(time and bytes processed). When I apply a condition by a clustered field 1 is 4x time faster and cheaper.

Clustered fields are only useful when you use a condition in the query? and not in a join? In this case, if I perform a join without any condition, the performance is the same with or without the clusters

How can I improve a join between two tables in BigQuery?

EDIT 2021-05-31

Add query execution plan of both jobs:

  1. Clustered

  2. Non-clustered

like image 301
Leandro Abad Avatar asked May 31 '21 01:05

Leandro Abad


People also ask

Can you do joins in BigQuery?

Google BigQuery does not support other join types, such as a full outer join or right outer join. In addition, Google BigQuery uses the default equals (=) operator to compare columns and does not support other operators.

Can we do clustering without partitioning in BigQuery?

You can create a clustered table by querying either a partitioned table or a non-partitioned table. You cannot change an existing table to a clustered table by using query results.

What is the difference between partitioning and clustering in BigQuery?

Partition pruning is done before the query runs, so you can get the query cost after partitioning pruning through a dry run. Cluster pruning is done when the query runs, so the cost is known only after the query finishes. You need partition-level management.


Video Answer


1 Answers

From docs, I would say that the cluster will be simply ignore as you are comparing using another column, during the Join.

Now, for optimizing a join, you could try to reduce data before the join. For instance, try filtering the tables, or pre-aggregating them to reduce as much data as you can. Finally, also take care of the order of the tables on the join. Order them from the biggest to the smallest one.

like image 102
Miller Andrés Trujillo Achury Avatar answered Nov 15 '22 00:11

Miller Andrés Trujillo Achury