I am trying to add a new column to an existing data frame using the withColumn statement in Spark Dataframe API. The below code works but I was wondering if there's a way that I can select more than one group. Let's say Group 1, 2, 3, 4 instead of only Group 1. I think I may be able to write when statement four times. I have seen people do that in some posts. However, in R, there is a %in% operator that can specify if a variable contains values in a vector, but I don't know if there's such thing in Spark. I checked on the Spark API documentation but most of the functions don't contain any examples.
# R Sample Code:
library(dplyr)
df1 <- df %>% mutate( Selected_Group = (Group %in% 1:4))
Spark Dataframe Sample Code That Selects Group 1:
val df1 = df.withColumn("Selected_Group", when($"Group" === 1, 1).otherwise(0))
Data
ID, Group
1, 0
2, 1
3, 2
. .
. .
100, 99
With UDF:
import org.apache.spark.sql.functions.udf
def in(s: Set[Int]) = udf((x: Int) => if (s.contains(x)) 1 else 0)
df.withColumn("Selected_Group", in((1 to 4).toSet)($"group"))
With raw SQL:
df.registerTempTable("df")
sqlContext.sql(
"SELECT *, CAST(group IN (1, 2, 3, 4) AS INT) AS Selected_Group FROM df"
)
With Column.in method:
import org.apache.spark.sql.functions.{lit, when}
import org.apache.spark.sql.types.IntegerType
df.withColumn(
"Selected_Group",
$"group".in((1 to 4).map(lit): _*).cast(IntegerType))
or when function:
df
.withColumn(
"Selected_Group",
when($"group".in((1 to 4).map(lit): _*), 1).otherwise(0))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With