Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Map column values to a a numeric type in spark

I have a df in spark which as the following structure:

amount gender status
1000   male   married
1313   female single
1000   male   married

Basically i want to create new column where gender is a number

amount gender status  gender_num
1000   male   married 1
1313   female single  2
1000   male   married 1

I tired the following:

  val gender = df.gender

  val gender_num = gender match {
case male => 1
case female => 2
}

I get the following error:

<console>:125: error: value pa_gender_category is not a member of org.apache.spark.sql.DataFrame
val gender = data.pa_gender_category

I know there is a stringtoindex function, but i would like to do this manually

like image 265
Anubhav Dikshit Avatar asked Apr 24 '17 12:04

Anubhav Dikshit


2 Answers

Use withColumn

val input = // load input DataFrame
val withGender = input.withColumn("gender_num", when($"gender" === "female", 2).otherwise(1))

You can chain more options:

val withGender = input.withColumn("gender_num", when($"gender" === "female", 2).when($"gender" == "other", 3).otherwise(1))

You can also use UDF like in Akash's answer. Be aware, that sometimes UDFs cannot be optimized as much as built-in functions, but they can be more readable

like image 191
T. Gawęda Avatar answered Oct 08 '22 15:10

T. Gawęda


You can use UDF of Spark

import org.apache.spark.sql.functions.udf
def genderToNumber: UserDefinedFunction = {
    udf((gender: String) => {
                             gender match {
                                           case "male" => 1
                                           case "female" => 2
                                          }
                          }               })

You can apply UDF by this

   val newDF = df.withColumn("gender_num", genderToNumber(df("gender")))
like image 23
Akash Sethi Avatar answered Oct 08 '22 14:10

Akash Sethi