Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

get all the dates between two dates in Spark DataFrame

I have a DF in which I have bookingDt and arrivalDt columns. I need to find all the dates between these two dates.

Sample code:

df = spark.sparkContext.parallelize(
            [Row(vyge_id=1000, bookingDt='2018-01-01', arrivalDt='2018-01-05')]).toDF()
diffDaysDF = df.withColumn("diffDays", datediff('arrivalDt', 'bookingDt'))
diffDaysDF.show()

code output:

+----------+----------+-------+--------+
| arrivalDt| bookingDt|vyge_id|diffDays|
+----------+----------+-------+--------+
|2018-01-05|2018-01-01|   1000|       4|
+----------+----------+-------+--------+

What I tried was finding the number of days between two dates and calculate all the dates using timedelta function and explode it.

dateList = [str(bookingDt + timedelta(i)) for i in range(diffDays)]

Expected output:

Basically, I need to build a DF with a record for each date in between bookingDt and arrivalDt, inclusive.

+----------+----------+-------+----------+
| arrivalDt| bookingDt|vyge_id|txnDt     |
+----------+----------+-------+----------+
|2018-01-05|2018-01-01|   1000|2018-01-01|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01|   1000|2018-01-02|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01|   1000|2018-01-03|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01|   1000|2018-01-04|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01|   1000|2018-01-05|
+----------+----------+-------+----------+
like image 269
Shankar Avatar asked Aug 08 '18 10:08

Shankar


People also ask

How do you find the difference between two dates in spark?

In order to get difference between two dates in days, years, months and quarters in pyspark can be accomplished by using datediff() and months_between() function. datediff() Function calculates the difference between two dates in days in pyspark.

What is withColumn in Pyspark?

DataFrame. withColumn (colName, col)[source] Returns a new DataFrame by adding a column or replacing the existing column that has the same name. The column expression must be an expression over this DataFrame ; attempting to add a column from some other DataFrame will raise an error.

How do you create a date dimension in Pyspark?

How to Begin. The process starts by generating an array of dates, then exploding this array into a data frame, and creating a temporary view called dates. Now that we have a temporary view containing dates, we can use Spark SQL to select the desired columns for the calendar dimension.

How do you find the difference between two columns in Pyspark?

Timestamp difference in PySpark can be calculated by using 1) unix_timestamp() to get the Time in seconds and subtract with other time to get the seconds 2) Cast TimestampType column to LongType and subtract two long values to get the difference in seconds, divide it by 60 to get the minute difference and finally ...


1 Answers

For Spark 2.4+ sequence can be used to create an array containg all dates between bookingDt and arrivalDt. This array can then be exploded.

from pyspark.sql import functions as F

df = df \
  .withColumn('bookingDt', F.col('bookingDt').cast('date')) \
  .withColumn('arrivalDt', F.col('arrivalDt').cast('date'))

df.withColumn('txnDt', F.explode(F.expr('sequence(bookingDt, arrivalDt, interval 1 day)')))\
  .show()

Output:

+-------+----------+----------+----------+
|vyge_id| bookingDt| arrivalDt|     txnDt|
+-------+----------+----------+----------+
|   1000|2018-01-01|2018-01-05|2018-01-01|
|   1000|2018-01-01|2018-01-05|2018-01-02|
|   1000|2018-01-01|2018-01-05|2018-01-03|
|   1000|2018-01-01|2018-01-05|2018-01-04|
|   1000|2018-01-01|2018-01-05|2018-01-05|
+-------+----------+----------+----------+
like image 158
werner Avatar answered Sep 29 '22 18:09

werner