I have a largeDataFrame
(multiple columns and billions of rows) and a smallDataFrame
(single column and 10,000 rows).
I'd like to filter all the rows from the largeDataFrame
whenever the some_identifier
column in the largeDataFrame
matches one of the rows in the smallDataFrame
.
Here's an example:
largeDataFrame
some_idenfitier,first_name 111,bob 123,phil 222,mary 456,sue
smallDataFrame
some_identifier 123 456
desiredOutput
111,bob 222,mary
Here is my ugly solution.
val smallDataFrame2 = smallDataFrame.withColumn("is_bad", lit("bad_row")) val desiredOutput = largeDataFrame.join(broadcast(smallDataFrame2), Seq("some_identifier"), "left").filter($"is_bad".isNull).drop("is_bad")
Is there a cleaner solution?
Therefore, select() method is useful when you simply need to select a subset of columns from a particular Spark DataFrame. On the other hand, selectExpr() comes in handy when you need to select particular columns while at the same time you also need to apply some sort of transformation over particular column(s).
In Spark isin() function is used to check if the DataFrame column value exists in a list/array of values. To use IS NOT IN, use the NOT operator to negate the result of the isin() function.
Anti Join. An anti join returns values from the left relation that has no match with the right. It is also referred to as a left anti join.
You'll need to use a left_anti
join in this case.
The left anti join is the opposite of a left semi join.
It filters out data from the right table in the left table according to a given key :
largeDataFrame .join(smallDataFrame, Seq("some_identifier"),"left_anti") .show // +---------------+----------+ // |some_identifier|first_name| // +---------------+----------+ // | 222| mary| // | 111| bob| // +---------------+----------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With