I am having trouble while trying to understand the concept of semi-join and how it is different from conventional join. I have tried some article already but not satisfied with the explanation, could someone please help me to understand it?
Semijoins are U-SQL's way filter a rowset based on the inclusion of its rows in another rowset. Other SQL dialects express this with the SELECT * FROM A WHERE A. key IN (SELECT B. key FROM B) pattern. There are two variants: LEFT SEMIJOIN and RIGHT SEMIJOIN .
Semijoin is a technique for processing a join between two tables that are stored sites. The basic idea is to reduce the transfer cost by first sending only the projected join column(s) to the other site, where it is joined with the second relation.
An anti-join is essentially the opposite of a semi-join: While a semi-join returns one copy of each row in the first table for which at least one match is found, an anti-join returns one copy of each row in the first table for which no match is found.
A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x. This is a filtering join.
Simple example. Let's select students with grades using left outer join:
SELECT DISTINCT s.id
FROM students s
LEFT JOIN grades g ON g.student_id = s.id
WHERE g.student_id IS NOT NULL
Now the same with left semi-join:
SELECT s.id
FROM students s
WHERE EXISTS (SELECT 1 FROM grades g
WHERE g.student_id = s.id)
The latter is generally more efficient (depending on concrete DBMS and query optimizer).
As far as I know SQL dialects that support SEMIJOIN/ANTISEMI
are U-SQL/Cloudera Impala.
SEMIJOIN:
Semijoins are U-SQL’s way filter a rowset based on the inclusion of its rows in another rowset. Other SQL dialects express this with the SELECT * FROM A WHERE A.key IN (SELECT B.key FROM B) pattern.
More info Semi Join and Anti Join Should Have Their Own Syntax in SQL:
“Semi” means that we don’t really join the right hand side, we only check if a join would yield results for any given tuple.
-- IN
SELECT *
FROM Employee
WHERE DeptName IN (
SELECT DeptName
FROM Dept
)
-- EXISTS
SELECT *
FROM Employee
WHERE EXISTS (
SELECT 1
FROM Dept
WHERE Employee.DeptName = Dept.DeptName
)
EDIT:
Another dialect that supports SEMI/ANTISEMI join is KQL:
kind=leftsemi (or kind=rightsemi)
Returns all the records from the left side that have matches from the right. The result table contains columns from the left side only.
let t1 = datatable(key:long, value:string)
[1, "a",
2, "b",
3, "c"];
let t2 = datatable(key:long)
[1,3];
t1 | join kind=leftsemi (t2) on key
demo
Output:
key value
1 a
3 c
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With