Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Dataset when to use Except vs Left Anti Join

I was wondering if there are performance difference between calling except (https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#except(org.apache.spark.sql.Dataset) and using a left anti-join. So far, the only difference I can see is that with the left anti-join, the 2 datasets can have different columns.

like image 726
alexgbelov Avatar asked Sep 19 '18 19:09

alexgbelov


People also ask

Which join is faster in PySpark?

Broadcast Join Working Broadcast joins are easier to run on a cluster. Spark can “broadcast” a small DataFrame by sending all the data in that small DataFrame to all nodes in the cluster. After the broadcast, small DataFrame Spark can perform a join without shuffling any of the data in the large DataFrame.

What is except in Spark?

EXCEPT. EXCEPT and EXCEPT ALL return the rows that are found in one relation but not the other. EXCEPT (alternatively, EXCEPT DISTINCT ) takes only distinct rows while EXCEPT ALL does not remove duplicates from the result rows. Note that MINUS is an alias for EXCEPT .

What does left anti join do?

A left anti join : This join returns rows in the left table that have no matching rows in the right table. A right anti join : This join returns rows in the right table that have no matching rows in the left table.

What is anti join in Spark?

Anti Join. An anti join returns values from the left relation that has no match with the right. It is also referred to as a left anti join.


1 Answers

Your title vs. explanation differ.

But, if you have the same structure you can use both methods to find missing data.

EXCEPT

is a specific implementation that enforces same structure and is a subtract operation, whereas

LEFT ANTI JOIN

allows different structures as you would say, but can give the same result.

Use cases differ: 1) Left Anti Join can apply to many situations pertaining to missing data - customers with no orders (yet), orphans in a database. 2) Except is for subtracting things, e.g. Machine Learning splitting data into test- and training sets.

Performance should not be a real deal breaker as they are different use cases in general and therefore difficult to compare. Except will involve the same data source whereas LAJ will involve different data sources.

like image 94
thebluephantom Avatar answered Nov 15 '22 09:11

thebluephantom