Filling missing dates in spark dataframe column

Tags:

I've a spark data frame with columns - "date" of type timestamp and "quantity" of type long. For each date, I've some value for quantity. The dates are sorted in increasing order. But there are some dates which are missing. For eg - Current df -

Date        |    Quantity
10-09-2016  |    1
11-09-2016  |    2
14-09-2016  |    0
16-09-2016  |    1
17-09-2016  |    0
20-09-2016  |    2

As you can see, the df has some missing dates like 12-09-2016, 13-09-2016 etc. I want to put 0 in the quantity field for those missing dates such that resultant df should look like -

Date        |    Quantity
10-09-2016  |    1
11-09-2016  |    2
12-09-2016  |    0
13-09-2016  |    0
14-09-2016  |    0
15-09-2016  |    0
16-09-2016  |    1
17-09-2016  |    0
18-09-2016  |    0
19-09-2016  |    0
20-09-2016  |    2

Any help/suggestion regarding this will be appreciated. Thanks in advance. Note that I am coding in scala.

805

asked Oct 12 '17 12:10

Ishan

2 Answers

I have written this answer in a bit verbose way for easy understanding of the code. It can be optimized.

Needed imports

import java.time.format.DateTimeFormatter
import java.time.{LocalDate, LocalDateTime}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{LongType, TimestampType}

UDFs for String to Valid date format

 val date_transform = udf((date: String) => {
    val dtFormatter = DateTimeFormatter.ofPattern("d-M-y")
    val dt = LocalDate.parse(date, dtFormatter)
    "%4d-%2d-%2d".format(dt.getYear, dt.getMonthValue, dt.getDayOfMonth)
      .replaceAll(" ", "0")
  })

Below UDF code taken from Iterate over dates range

  def fill_dates = udf((start: String, excludedDiff: Int) => {
    val dtFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
    val fromDt = LocalDateTime.parse(start, dtFormatter)
    (1 to (excludedDiff - 1)).map(day => {
      val dt = fromDt.plusDays(day)
      "%4d-%2d-%2d".format(dt.getYear, dt.getMonthValue, dt.getDayOfMonth)
        .replaceAll(" ", "0")
    })
  })

Setting up sample dataframe (df)

val df = Seq(
      ("10-09-2016", 1),
      ("11-09-2016", 2),
      ("14-09-2016", 0),
      ("16-09-2016", 1),
      ("17-09-2016", 0),
      ("20-09-2016", 2)).toDF("date", "quantity")
      .withColumn("date", date_transform($"date").cast(TimestampType))
      .withColumn("quantity", $"quantity".cast(LongType))

df.printSchema()
root
 |-- date: timestamp (nullable = true)
 |-- quantity: long (nullable = false)


df.show()    
+-------------------+--------+
|               date|quantity|
+-------------------+--------+
|2016-09-10 00:00:00|       1|
|2016-09-11 00:00:00|       2|
|2016-09-14 00:00:00|       0|
|2016-09-16 00:00:00|       1|
|2016-09-17 00:00:00|       0|
|2016-09-20 00:00:00|       2|
+-------------------+--------+

Create a temporary dataframe(tempDf) to union with df:

val w = Window.orderBy($"date")
val tempDf = df.withColumn("diff", datediff(lead($"date", 1).over(w), $"date"))
  .filter($"diff" > 1) // Pick date diff more than one day to generate our date
  .withColumn("next_dates", fill_dates($"date", $"diff"))
  .withColumn("quantity", lit("0"))
  .withColumn("date", explode($"next_dates"))
  .withColumn("date", $"date".cast(TimestampType))

tempDf.show(false)
+-------------------+--------+----+------------------------+
|date               |quantity|diff|next_dates              |
+-------------------+--------+----+------------------------+
|2016-09-12 00:00:00|0       |3   |[2016-09-12, 2016-09-13]|
|2016-09-13 00:00:00|0       |3   |[2016-09-12, 2016-09-13]|
|2016-09-15 00:00:00|0       |2   |[2016-09-15]            |
|2016-09-18 00:00:00|0       |3   |[2016-09-18, 2016-09-19]|
|2016-09-19 00:00:00|0       |3   |[2016-09-18, 2016-09-19]|
+-------------------+--------+----+------------------------+

Now union two dataframes

val result = df.union(tempDf.select("date", "quantity"))
  .orderBy("date")

result.show()
+-------------------+--------+
|               date|quantity|
+-------------------+--------+
|2016-09-10 00:00:00|       1|
|2016-09-11 00:00:00|       2|
|2016-09-12 00:00:00|       0|
|2016-09-13 00:00:00|       0|
|2016-09-14 00:00:00|       0|
|2016-09-15 00:00:00|       0|
|2016-09-16 00:00:00|       1|
|2016-09-17 00:00:00|       0|
|2016-09-18 00:00:00|       0|
|2016-09-19 00:00:00|       0|
|2016-09-20 00:00:00|       2|
+-------------------+--------+

114

answered Sep 23 '22 21:09

mrsrinivas

Based on the @mrsrinivas excellent answer, here is the PySpark version.

Needed imports

from typing import List
import datetime
from pyspark.sql import DataFrame, Window
from pyspark.sql.functions import col, lit, udf, datediff, lead, explode
from pyspark.sql.types import DateType, ArrayType

UDF to create the range of next dates

def _get_next_dates(start_date: datetime.date, diff: int) -> List[datetime.date]:
    return [start_date + datetime.timedelta(days=days) for days in range(1, diff)]

Function the create the DateFrame filling the dates (support "grouping" columns):

def _get_fill_dates_df(df: DataFrame, date_column: str, group_columns: List[str], fill_column: str) -> DataFrame:
    get_next_dates_udf = udf(_get_next_dates, ArrayType(DateType()))

    window = Window.orderBy(*group_columns, date_column)

    return df.withColumn("_diff", datediff(lead(date_column, 1).over(window), date_column)) \
        .filter(col("_diff") > 1).withColumn("_next_dates", get_next_dates_udf(date_column, "_diff")) \
        .withColumn(fill_column, lit("0")).withColumn(date_column, explode("_next_dates")) \
        .drop("_diff", "_next_dates")

The usage of the function:

fill_df = _get_fill_dates_df(df, "Date", [], "Quantity")
df = df.union(fill_df)

It assumes that the date column is already in date type.

answered Sep 25 '22 21:09

Fernando Camargo

Related questions
                            
                                truthy and falsey values
                            
                                Compact syntax for get head of list as Option
                            
                                Multi-key Map in Scala
                            
                                Given an element in a sequence, how to get the previous element?
                            
                                Saving as Text in Spark 1.30 using Dataframes in Scala
                            
                                Why "reading from a file" is not pure function?
                            
                                Assign Future[Unit] to Unit
                            
                                What parts of a Java application should be written in Scala?
                            
                                HowTo get a Map from a csv string
                            
                                Best way to change list type in scala
                            
                                How to make the following code type safe?
                            
                                Why does Seq[V] not extend Map[Int,V] nor does Set[V] extend Map[V,Bool]?
                            
                                count number of lines in file - Scala
                            
                                What is the #:: operator in a scala Stream? [duplicate]
                            
                                Is there a way to initialize multiple variables from array or List in Scala?
                            
                                Scala dispatch GET request, fail to parse response to json
                            
                                Java and Scala ClassNotFoundException after upgrading to IntelliJ 14
                            
                                Meaning of underscore in lift[A,B](f: A => B): Option[A] => Option[B] = _ map f
                            
                                Generate alphanumeric string
                            
                                Installing spark on windows 10 spark.hive.hiveSessionState

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filling missing dates in spark dataframe column

Tags:

datetime

scala

apache-spark

apache-spark-sql

Ishan

People also ask

2 Answers

mrsrinivas

Fernando Camargo

Recent Activity

Donate For Us