Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter spark/scala dataframe if column is present in set

I'm using Spark 1.4.0, this is what I have so far:

data.filter($"myColumn".in(lit("A"), lit("B"), lit("C"), ...))

The function lit converts a literal to a column.

Ideally I would put my A, B, C in a Set and check like this:

val validValues = Set("A", "B", "C", ...)
data.filter($"myColumn".in(validValues))

What's the correct syntax? Are there any alternative concise solutions?

like image 715
Marsellus Wallace Avatar asked Dec 03 '22 14:12

Marsellus Wallace


1 Answers

Spark 1.4 or older:

val validValues = Set("A", "B", "C").map(lit(_))
data.filter($"myColumn".in(validValues.toSeq: _*))

Spark 1.5 or newer:

val validValues = Set("A", "B", "C")
data.filter($"myColumn".isin(validValues.toSeq: _*))
like image 140
Paweł Jurczenko Avatar answered Dec 21 '22 08:12

Paweł Jurczenko