I have two tables - one is a core data with a pair of IDs (PC1 and P2) and some blob data (P3). The other is a blacklist data for PC1 in the former table. I will call the first table in_df and the second blacklist_df.
What I want to do is to remove rows from in_df long as in_df.PC1 == blacklist_df.P1 and in_df.P2 == black_list_df.B1. Here is a code snippet to show what I want to achieve more explicitly.
in_df = sqlContext.createDataFrame([[1,2,'A'],[2,1,'B'],[3,1,'C'], [4,11,'D'],[1,3,'D']],['PC1','P2','P3']) in_df.show() +---+---+---+ |PC1| P2| P3| +---+---+---+ | 1| 2| A| | 2| 1| B| | 3| 1| C| | 4| 11| D| | 1| 3| D| +---+---+---+ blacklist_df = sqlContext.createDataFrame([[1,2],[2,1]],['P1','B1']) blacklist_df.show() +---+---+ | P1| B1| +---+---+ | 1| 2| | 2| 1| +---+---+
In the end what I want to get is the followings:
+---+--+--+ |PC1|P2|P3| +---+--+--+ | 1| 3| D| | 3| 1| C| | 4|11| D| +---+--+--+
I tried LEFT_ANTI join but I haven't been successful.
There are two types of anti joins: A left anti join : This join returns rows in the left table that have no matching rows in the right table. A right anti join : This join returns rows in the right table that have no matching rows in the left table.
We can use the '~' operator on the semi-join. It results in anti-join. Semi-join: Similar to inner join, semi-join returns the intersection but it only returns the columns from the left table and not the right.
We'll walk through each of these situations later on, but first, here's a primer on the types of anti joins and how to write them. A left anti join : This join returns rows in the left table that have no matching rows in the right table. A right anti join : This join returns rows in the right table that have no matching rows in the left table.
A left anti join : This join returns rows in the left table that have no matching rows in the right table. A right anti join : This join returns rows in the right table that have no matching rows in the left table. We will walk through how to use an anti join, using a left anti join.
Includes right rows that do not match left rows. As you can see, there is no dedicated NOT IN syntax for left vs. right anti semi join - we achieve the effect simply by switching the table positions within SQL text. A Cartesian product of all left with all right rows.
The condition in the WHERE clauseis applied so that the statement only retrieves the data from the US, UK, and China rows. Because we use the LEFT JOIN clause, all rows that satisfy the condition in the WHERE clause of the countries table are included in the result set.
Pass the join conditions as a list to the join
function, and specify how='left_anti'
as the join type:
in_df.join( blacklist_df, [in_df.PC1 == blacklist_df.P1, in_df.P2 == blacklist_df.B1], how='left_anti' ).show() +---+---+---+ |PC1| P2| P3| +---+---+---+ | 1| 3| D| | 4| 11| D| | 3| 1| C| +---+---+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With