Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery filter using LEFT OUTER JOIN or Anti-join

I can think of 2 methods to filter in Google BigQuery. (Say Table2 is small.) Which method of filtering would be preferable, and why? Does it depend on the size of table1? Or is there another, better method?

SELECT
  a.ID
FROM
  table1 AS a
LEFT OUTER JOIN
  table2 AS b
ON
  a.ID = b.ID
WHERE 
  b.ID is NULL

or

SELECT
  ID
FROM
  table1
WHERE
  ID NOT IN (
    SELECT
      ID
    FROM
      table2)
like image 637
Oof Avatar asked Jan 21 '16 15:01

Oof


People also ask

What is the use of left anti join?

A left anti join : This join returns rows in the left table that have no matching rows in the right table. A right anti join : This join returns rows in the right table that have no matching rows in the left table.

How do you optimize joins in BigQuery?

Use INT64 data types in joins to reduce cost and improve comparison performance. Best practice: If your use case supports it, use INT64 data types in joins instead of STRING data types. BigQuery does not index primary keys like traditional databases, so the wider the join column is, the longer the comparison takes.

What is the default join in BigQuery?

In addition, Google BigQuery uses the default equals (=) operator to compare columns and does not support other operators. Select Inner Join to only include records in which the joined columns from both tables satisfy the join condition.

How do I use left anti join in SQL?

Example. Imagine we have two tables; Cats and Dogs , and we then run the following query: SELECT CatName AS PetName FROM Cats EXCEPT SELECT DogName AS PetName FROM Dogs; This query uses the EXCEPT operator, and it will appear as a LEFT ANTI SEMI JOIN in the execution plan.


2 Answers

Second is limited to one and only one field to be filtered by.

First can be used to filter by multiple fields, like below

SELECT
  a.ID
FROM
  table1 AS a
LEFT OUTER JOIN
  table2 AS b
ON
  a.ID = b.ID AND a.Field2 = b.Field2 AND a.Field3 = b.Field3
WHERE 
  b.ID is NULL
like image 116
Mikhail Berlyant Avatar answered Oct 18 '22 21:10

Mikhail Berlyant


I tested a two fields anti-join and the difference was noticeable.

SELECT fields, ..
FROM inputA A
LEFT JOIN inputB B
  ON A.key = B.key AND A.key2 = B.key2
WHERE B.key IS NULL AND B.key2 IS NULL

Performs better than its NOT IN equivalent:

SELECT fields, ..
FROM inputA A
WHERE key NOT IN (SELECT NUM_RUT FROM key)
  AND key2 NOT IN (SELECT key2 FROM inputB)

in this case 5s vs 7s in BigQuery after reading 4.6Gb.

like image 5
Vidal Gonzalez Avatar answered Oct 18 '22 20:10

Vidal Gonzalez