Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How To Apply Multiple Conditions on Case-Otherwise Statement Using Spark Dataframe API

I am trying to add a new column to an existing data frame using the withColumn statement in Spark Dataframe API. The below code works but I was wondering if there's a way that I can select more than one group. Let's say Group 1, 2, 3, 4 instead of only Group 1. I think I may be able to write when statement four times. I have seen people do that in some posts. However, in R, there is a %in% operator that can specify if a variable contains values in a vector, but I don't know if there's such thing in Spark. I checked on the Spark API documentation but most of the functions don't contain any examples.

# R Sample Code:
 library(dplyr)
 df1 <- df %>% mutate( Selected_Group = (Group %in% 1:4))

Spark Dataframe Sample Code That Selects Group 1:

 val df1 = df.withColumn("Selected_Group", when($"Group" === 1, 1).otherwise(0))

Data

ID, Group
1, 0
2, 1
3, 2
. .
. .
100, 99

like image 455
SH Y. Avatar asked Feb 24 '26 08:02

SH Y.


1 Answers

With UDF:

import org.apache.spark.sql.functions.udf

def in(s: Set[Int]) = udf((x: Int) => if (s.contains(x)) 1 else 0)
df.withColumn("Selected_Group", in((1 to 4).toSet)($"group"))

With raw SQL:

df.registerTempTable("df")
sqlContext.sql(
    "SELECT *, CAST(group IN (1, 2, 3, 4) AS INT) AS Selected_Group FROM df"
)

With Column.in method:

import org.apache.spark.sql.functions.{lit, when}
import org.apache.spark.sql.types.IntegerType

df.withColumn(
  "Selected_Group",
  $"group".in((1 to 4).map(lit): _*).cast(IntegerType))

or when function:

df
 .withColumn(
   "Selected_Group",
   when($"group".in((1 to 4).map(lit): _*), 1).otherwise(0))
like image 123
zero323 Avatar answered Feb 27 '26 02:02

zero323



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!