I have a dataframe of the below format:
+----+---+-----+------+-----+------+
|AGEF|SEX|F0_34|F35_44|M0_34|M35_44|
+----+---+-----+------+-----+------+
| 30| 0| 0| 0| 0| 0|
| 94| 1| 0| 0| 0| 0|
| 94| 0| 0| 0| 0| 0|
| 94| 0| 0| 0| 0| 0|
| 94| 1| 0| 0| 0| 0|
| 44| 0| 0| 0| 0| 0|
| 66| 0| 0| 0| 0| 0|
| 66| 0| 0| 0| 0| 0|
| 74| 0| 0| 0| 0| 0|
| 74| 0| 0| 0| 0| 0|
| 29| 0| 0| 0| 0| 0|
Now based on the values of columns AGEF and SEX I need to assign 1 to corresponding column name. Each column name is self explanatory like F0_34 is female between age 0 to 34 similarly for other case.
Expected output is
+----+---+-----+------+-----+------+
|AGEF|SEX|F0_34|F35_44|M0_34|M35_44|
+----+---+-----+------+-----+------+
| 30| 0| 1| 0| 0| 0|
| 94| 1| 0| 0| 0| 0|
| 94| 0| 0| 0| 0| 0|
| 94| 0| 0| 0| 0| 0|
| 94| 1| 0| 0| 0| 0|
| 44| 0| 0| 1| 0| 0|
| 66| 0| 0| 0| 0| 0|
| 66| 0| 0| 0| 0| 0|
| 74| 0| 0| 0| 0| 0|
| 74| 0| 0| 0| 0| 0|
| 29| 0| 1| 0| 0| 0|
Thanks in Advance!!!
Typically the most efficient approach is to operate directly on SQL expressions. For example:
def categorize(ageRanges: Seq[(Int, Int)], sexValues: Seq[(Int, String)]) = for {
(ageL, ageH) <- ageRanges
(sexV, sexL) <- sexValues
} yield ($"SEX" === sexL && $"AGEF".between(ageL, ageH)).alias(
s"$sexL-$ageL-$ageH"
)
df.select(
$"*" +: categorize(Seq((0, 34), (35, 44)), Seq((0, "F"), (1, "M"))): _*
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With