Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SPARK : Set a column value based on multiple row conditions

I have a dataframe of the below format:

+----+---+-----+------+-----+------+
|AGEF|SEX|F0_34|F35_44|M0_34|M35_44|
+----+---+-----+------+-----+------+
|  30|  0|    0|     0|    0|     0|
|  94|  1|    0|     0|    0|     0|
|  94|  0|    0|     0|    0|     0|
|  94|  0|    0|     0|    0|     0|
|  94|  1|    0|     0|    0|     0|
|  44|  0|    0|     0|    0|     0|
|  66|  0|    0|     0|    0|     0|
|  66|  0|    0|     0|    0|     0|
|  74|  0|    0|     0|    0|     0|
|  74|  0|    0|     0|    0|     0|
|  29|  0|    0|     0|    0|     0|

Now based on the values of columns AGEF and SEX I need to assign 1 to corresponding column name. Each column name is self explanatory like F0_34 is female between age 0 to 34 similarly for other case.

Expected output is

+----+---+-----+------+-----+------+
|AGEF|SEX|F0_34|F35_44|M0_34|M35_44|
+----+---+-----+------+-----+------+
|  30|  0|    1|     0|    0|     0|
|  94|  1|    0|     0|    0|     0|
|  94|  0|    0|     0|    0|     0|
|  94|  0|    0|     0|    0|     0|
|  94|  1|    0|     0|    0|     0|
|  44|  0|    0|     1|    0|     0|
|  66|  0|    0|     0|    0|     0|
|  66|  0|    0|     0|    0|     0|
|  74|  0|    0|     0|    0|     0|
|  74|  0|    0|     0|    0|     0|
|  29|  0|    1|     0|    0|     0|

Thanks in Advance!!!

like image 629
nareshbabral Avatar asked Jan 07 '23 07:01

nareshbabral


1 Answers

Typically the most efficient approach is to operate directly on SQL expressions. For example:

def categorize(ageRanges: Seq[(Int, Int)], sexValues: Seq[(Int, String)]) = for {
  (ageL, ageH) <- ageRanges
  (sexV, sexL) <- sexValues
} yield ($"SEX" === sexL && $"AGEF".between(ageL, ageH)).alias(
  s"$sexL-$ageL-$ageH"
)

df.select(
  $"*" +: categorize(Seq((0, 34), (35, 44)), Seq((0, "F"), (1, "M"))): _*
)
like image 188
zero323 Avatar answered Jan 08 '23 22:01

zero323