Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark's Column.isin function does not take List

I am trying to filter out rows from my Spark Dataframe.

val sequence = Seq(1,2,3,4,5)
df.filter(df("column").isin(sequence))

Unfortunately, I get an unsupported literal type error

java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon List(1,2,3,4,5)

according to the documentation it takes a scala.collection.Seq list

I guess I don't want a literal? Then what can I take in, some sort of wrapper class?

like image 880
Jake Fund Avatar asked Apr 12 '16 02:04

Jake Fund


2 Answers

@JustinPihony's answer is correct but it's incomplete. The isin function takes a repeated parameter for argument, so you'll need to pass it as so :

scala> val df = sc.parallelize(Seq(1,2,3,4,5,6,7,8,9)).toDF("column")
// df: org.apache.spark.sql.DataFrame = [column: int]

scala> val sequence = Seq(1,2,3,4,5)
// sequence: Seq[Int] = List(1, 2, 3, 4, 5)

scala> val result = df.filter(df("column").isin(sequence : _*))
// result: org.apache.spark.sql.DataFrame = [column: int]

scala> result.show
// +------+
// |column|
// +------+
// |     1|
// |     2|
// |     3|
// |     4|
// |     5|
// +------+
like image 126
eliasah Avatar answered Oct 08 '22 11:10

eliasah


This is happening because the underlying Scala implementation uses varargs, so the documentation in Java is not quite correct. It is using the @varargs annotation, so you can just pass in an array.

like image 26
Justin Pihony Avatar answered Oct 08 '22 13:10

Justin Pihony