Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter Spark DataFrame based on another DataFrame that specifies denylist criteria

I have a largeDataFrame (multiple columns and billions of rows) and a smallDataFrame (single column and 10,000 rows).

I'd like to filter all the rows from the largeDataFrame whenever the some_identifier column in the largeDataFrame matches one of the rows in the smallDataFrame.

Here's an example:

largeDataFrame

some_idenfitier,first_name 111,bob 123,phil 222,mary 456,sue 

smallDataFrame

some_identifier 123 456 

desiredOutput

111,bob 222,mary 

Here is my ugly solution.

val smallDataFrame2 = smallDataFrame.withColumn("is_bad", lit("bad_row")) val desiredOutput = largeDataFrame.join(broadcast(smallDataFrame2), Seq("some_identifier"), "left").filter($"is_bad".isNull).drop("is_bad") 

Is there a cleaner solution?

like image 560
Powers Avatar asked Oct 06 '16 04:10

Powers


People also ask

What is the difference between select and selectExpr in Spark?

Therefore, select() method is useful when you simply need to select a subset of columns from a particular Spark DataFrame. On the other hand, selectExpr() comes in handy when you need to select particular columns while at the same time you also need to apply some sort of transformation over particular column(s).

How do I use ISIN in Pyspark?

In Spark isin() function is used to check if the DataFrame column value exists in a list/array of values. To use IS NOT IN, use the NOT operator to negate the result of the isin() function.

What is Leftanti join in Spark?

Anti Join. An anti join returns values from the left relation that has no match with the right. It is also referred to as a left anti join.


1 Answers

You'll need to use a left_anti join in this case.

The left anti join is the opposite of a left semi join.

It filters out data from the right table in the left table according to a given key :

largeDataFrame    .join(smallDataFrame, Seq("some_identifier"),"left_anti")    .show // +---------------+----------+ // |some_identifier|first_name| // +---------------+----------+ // |            222|      mary| // |            111|       bob| // +---------------+----------+ 
like image 75
eliasah Avatar answered Sep 22 '22 04:09

eliasah