Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I filter rows based on whether a column value is in a Set of Strings in a Spark DataFrame

Is there a more elegant way of filtering based on values in a Set of String?

def myFilter(actions: Set[String], myDF: DataFrame): DataFrame = {
  val containsAction = udf((action: String) => {
    actions.contains(action)
  })

  myDF.filter(containsAction('action))
}

In SQL you can do

select * from myTable where action in ('action1', 'action2', 'action3')
like image 522
zzztimbo Avatar asked Jul 14 '15 01:07

zzztimbo


People also ask

How do you filter rows in PySpark DataFrame?

PySpark filter() function is used to filter the rows from RDD/DataFrame based on the given condition or SQL expression, you can also use where() clause instead of the filter() if you are coming from an SQL background, both these functions operate exactly the same.

How do you filter strings in PySpark?

In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.

How do you filter rows in a data frame?

You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows. You can also write the above statement with a variable.


1 Answers

How about this:

myDF.filter("action in (1,2)")

OR

import org.apache.spark.sql.functions.lit       
myDF.where($"action".in(Seq(1,2).map(lit(_)):_*))

OR

import org.apache.spark.sql.functions.lit       
myDF.where($"action".in(Seq(lit(1),lit(2)):_*))

Additional support will be added to make this cleaner in 1.5

like image 102
Justin Pihony Avatar answered Sep 20 '22 19:09

Justin Pihony