I have two DataFrames: a
and b
. This is how they look like:
a
-------
v1 string
v2 string
roughly hundreds of millions rows
b
-------
v2 string
roughly tens of millions rows
I would like to keep rows from DataFrame a
where v2
is not in b("v2")
.
I know I could use left join and filter where right side is null or SparkSQL with "not in" construction. I bet there is better approach though.
The Spark where() function is defined to filter rows from the DataFrame or the Dataset based on the given one or multiple conditions or SQL expression. The where() operator can be used instead of the filter when the user has the SQL background. Both the where() and filter() functions operate precisely the same.
The Spark DataFrame provides the drop() method to drop the column or the field from the DataFrame or the Dataset. The drop() method is also used to remove the multiple columns from the Spark DataFrame or the Database.
Pretty simple. Use the except() to subtract or find the difference between two dataframes.
PySpark Filter with Multiple Conditions In PySpark, to filter() rows on DataFrame based on multiple conditions, you case use either Column with a condition or SQL expression. Below is just a simple example using AND (&) condition, you can extend this with OR(|), and NOT(!) conditional expressions as needed.
You can achieve that using the except
method of Dataset, wich "Returns a new Dataset containing rows in this Dataset but not in another Dataset"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With