Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill null values in dataframe column with next value

I have to fill the first null values with immediate value of the same column in dataframe. This logic applies only on first consecutive null values only of the column.

I have a dataframe with similar to below

 //I replaced null to 0 in value column
 val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
               (5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
               .toDF("value", "col2", "col3")

scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|0    |exA |30  |
|0    |exB |22  |
|0    |exC |19  |
|16   |exD |13  |
|5    |exE |28  |
|6    |exF |26  |
|0    |exG |12  |
|13   |exH |53  |
+-----+----+----+

From this dataframe I am expecting as below

scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|16   |exA |30  |    // Change the value 0 to 16 at value column
|16   |exB |22  |    // Change the value 0 to 16 at value column
|16   |exC |19  |    // Change the value 0 to 16 at value column
|16   |exD |13  |
|5    |exE |28  |
|6    |exF |26  |
|0    |exG |12  |    // value should not be change here
|13   |exH |53  |
+-----+----+----+

Please help me solve this.

like image 355
Rao Avatar asked Dec 06 '25 06:12

Rao


1 Answers

You can use Window function for this purpose

 val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
           (5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
           .toDF("value", "col2", "col3")
 val w = Window.orderBy($"col2".desc)
 df.withColumn("Result", last(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
  .orderBy($"col2")
  .show(10)

Will result in

+-----+----+----+------+
|value|col2|col3|Result|
+-----+----+----+------+
|    0| exA|  30|    16|
|    0| exB|  22|    16|
|    0| exC|  19|    16|
|   16| exD|  13|    16|
|    5| exE|  28|     5|
|    6| exF|  26|     6|
|    0| exG|  12|    13|
|   13| exH|  53|    13|
+-----+----+----+------+

Expression df.orderBy($"col2") is needed only to show final results in right order. You can skip it if you don't care about final order.

UPDATE To get exactly what you need you should you a little bit more complicated code

val w = Window.orderBy($"col2")
val w2 = Window.orderBy($"col2".desc)
df.withColumn("IntermediateResult", first(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
  .withColumn("Result", when($"IntermediateResult".isNull, last($"IntermediateResult", ignoreNulls = true).over(w2)).otherwise($"value"))
  .orderBy($"col2")
    .show(10)

+-----+----+----+------------------+------+
|value|col2|col3|IntermediateResult|Result|
+-----+----+----+------------------+------+
|    0| exA|  30|              null|    16|
|    0| exB|  22|              null|    16|
|    0| exC|  19|              null|    16|
|   16| exD|  13|                16|    16|
|    5| exE|  28|                16|     5|
|    6| exF|  26|                16|     6|
|    0| exG|  12|                16|     0|
|   13| exH|  53|                16|    13|
+-----+----+----+------------------+------+
like image 139
Avseiytsev Dmitriy Avatar answered Dec 08 '25 19:12

Avseiytsev Dmitriy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!