SPARK : Set a column value based on multiple row conditions

Question

I have a dataframe of the below format:

+----+---+-----+------+-----+------+
|AGEF|SEX|F0_34|F35_44|M0_34|M35_44|
+----+---+-----+------+-----+------+
|  30|  0|    0|     0|    0|     0|
|  94|  1|    0|     0|    0|     0|
|  94|  0|    0|     0|    0|     0|
|  94|  0|    0|     0|    0|     0|
|  94|  1|    0|     0|    0|     0|
|  44|  0|    0|     0|    0|     0|
|  66|  0|    0|     0|    0|     0|
|  66|  0|    0|     0|    0|     0|
|  74|  0|    0|     0|    0|     0|
|  74|  0|    0|     0|    0|     0|
|  29|  0|    0|     0|    0|     0|

Now based on the values of columns AGEF and SEX I need to assign 1 to corresponding column name. Each column name is self explanatory like F0_34 is female between age 0 to 34 similarly for other case.

Expected output is

+----+---+-----+------+-----+------+
|AGEF|SEX|F0_34|F35_44|M0_34|M35_44|
+----+---+-----+------+-----+------+
|  30|  0|    1|     0|    0|     0|
|  94|  1|    0|     0|    0|     0|
|  94|  0|    0|     0|    0|     0|
|  94|  0|    0|     0|    0|     0|
|  94|  1|    0|     0|    0|     0|
|  44|  0|    0|     1|    0|     0|
|  66|  0|    0|     0|    0|     0|
|  66|  0|    0|     0|    0|     0|
|  74|  0|    0|     0|    0|     0|
|  74|  0|    0|     0|    0|     0|
|  29|  0|    1|     0|    0|     0|

Thanks in Advance!!!

zero323 · Accepted Answer

Typically the most efficient approach is to operate directly on SQL expressions. For example:

def categorize(ageRanges: Seq[(Int, Int)], sexValues: Seq[(Int, String)]) = for {
  (ageL, ageH) <- ageRanges
  (sexV, sexL) <- sexValues
} yield ($"SEX" === sexL && $"AGEF".between(ageL, ageH)).alias(
  s"$sexL-$ageL-$ageH"
)

df.select(
  $"*" +: categorize(Seq((0, 34), (35, 44)), Seq((0, "F"), (1, "M"))): _*
)

SPARK : Set a column value based on multiple row conditions

Tags:

dataframe

apache-spark

apache-spark-sql

nareshbabral

1 Answers

zero323

Recent Activity

Donate For Us

SPARK : Set a column value based on multiple row conditions

Tags:

dataframe

apache-spark

apache-spark-sql

nareshbabral

1 Answers

zero323

Related questions

Recent Activity

Donate For Us