Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to ignore null values when using lead window function in Spark

My dataframe like this

id  value  date    
1   100    2017 
1   null   2016 
1   20     2015 
1   100    2014

I would like to get most recent previous value but ignoring null

id  value  date   recent value
1   100    2017    20
1   null   2016    20
1   20     2015   100
1   100    2014   null

Is there any way to ignore null values while using lead window function.

like image 697
John Avatar asked Feb 09 '18 13:02

John


1 Answers

Is it possible to ignore null values when using lead window function in Spark

It is not.

I would like to get most recent value but ignoring null

Just use last (or first) with ignoreNulls:

def last(columnName: String, ignoreNulls: Boolean): Column

Aggregate function: returns the last value of the column in a group.

The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._

val df = Seq(
  (1, Some(100), 2017), (1, None, 2016), (1, Some(20), 2015), 
  (1, Some(100), 2014)
).toDF("id", "value", "date")

df.withColumn(
  "last_value",
   last("value", true).over(Window.partitionBy("id").orderBy("date"))
).show

+---+-----+----+----------+                                                     
| id|value|date|last_value|
+---+-----+----+----------+
|  1|  100|2014|       100|
|  1|   20|2015|        20|
|  1| null|2016|        20|
|  1|  100|2017|       100|
+---+-----+----+----------+
like image 98
Alper t. Turker Avatar answered Sep 22 '22 18:09

Alper t. Turker