Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Scala filter DataFrame where value not in another DataFrame

I have two DataFrames: a and b. This is how they look like:

a
-------
v1 string
v2 string

roughly hundreds of millions rows


b
-------
v2 string

roughly tens of millions rows

I would like to keep rows from DataFrame a where v2 is not in b("v2").

I know I could use left join and filter where right side is null or SparkSQL with "not in" construction. I bet there is better approach though.

like image 797
devopslife Avatar asked Feb 14 '16 23:02

devopslife


People also ask

How do you filter records from a DataFrame in Spark?

The Spark where() function is defined to filter rows from the DataFrame or the Dataset based on the given one or multiple conditions or SQL expression. The where() operator can be used instead of the filter when the user has the SQL background. Both the where() and filter() functions operate precisely the same.

How do I exclude a column from a DataFrame in Spark?

The Spark DataFrame provides the drop() method to drop the column or the field from the DataFrame or the Dataset. The drop() method is also used to remove the multiple columns from the Spark DataFrame or the Database.

How do you find the difference between two data frames in Spark?

Pretty simple. Use the except() to subtract or find the difference between two dataframes.

How do you filter multiple conditions in PySpark DataFrame?

PySpark Filter with Multiple Conditions In PySpark, to filter() rows on DataFrame based on multiple conditions, you case use either Column with a condition or SQL expression. Below is just a simple example using AND (&) condition, you can extend this with OR(|), and NOT(!) conditional expressions as needed.


1 Answers

You can achieve that using the except method of Dataset, wich "Returns a new Dataset containing rows in this Dataset but not in another Dataset"

like image 174
Javier Alba Avatar answered Oct 21 '22 03:10

Javier Alba