Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write a condition based on multiple values for a DataFrame in Spark

I'm working on a Spark Application (using Scala) and I have a List which contains multiple values. I'd like to use this list in order to write a where clause for my DataFrame and select only a subset on tuples. For example, my List contains 'value1', 'value2', and 'value3'. and I would like to write something like this:

mydf.where($"col1" === "value1" || $"col1" === "value2" || $"col1" === "value3)

How can I do that programmatically cause the list contains many values?

like image 470
HHH Avatar asked Mar 11 '23 22:03

HHH


1 Answers

You can map a list of values to a list of "filters" (with type Column), and reduce this list into a single filter by applying the || operator on every two filters:

val possibleValues = Seq("value1", "value2", "value3")
val result = mydf.where(possibleValues.map($"col1" === _).reduce(_ || _))
like image 154
Tzach Zohar Avatar answered Mar 17 '23 19:03

Tzach Zohar