Is there a more elegant way of filtering based on values in a Set of String?
def myFilter(actions: Set[String], myDF: DataFrame): DataFrame = {
val containsAction = udf((action: String) => {
actions.contains(action)
})
myDF.filter(containsAction('action))
}
In SQL you can do
select * from myTable where action in ('action1', 'action2', 'action3')
PySpark filter() function is used to filter the rows from RDD/DataFrame based on the given condition or SQL expression, you can also use where() clause instead of the filter() if you are coming from an SQL background, both these functions operate exactly the same.
In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.
You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows. You can also write the above statement with a variable.
How about this:
myDF.filter("action in (1,2)")
OR
import org.apache.spark.sql.functions.lit
myDF.where($"action".in(Seq(1,2).map(lit(_)):_*))
OR
import org.apache.spark.sql.functions.lit
myDF.where($"action".in(Seq(lit(1),lit(2)):_*))
Additional support will be added to make this cleaner in 1.5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With