I have a Spark Dataframe in that consists of a series of dates: <pre class="prettyprint"><code>from pyspark.sql import SQLContext from pyspark.sql import Row from pyspark.sql.types import * sqlContext = SQLContext(sc) import pandas as pd rdd = sc.parallelizesc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876','sip:4534454450'), ('X02','2014-02-13T12:35:37.405','2014-02-13T12:32:13.321','sip:6413445440'), ('X03','2014-02-13T12:36:03.825','2014-02-13T12:32:15.229','sip:4534437492'), ('XO4','2014-02-13T12:37:05.460','2014-02-13T12:32:36.881','sip:6474454453'), ('XO5','2014-02-13T12:36:52.721','2014-02-13T12:33:30.323','sip:8874458555')]) schema = StructType([StructField('ID', StringType(), True), StructField('EndDateTime', StringType(), True), StructField('StartDateTime', StringType(), True)]) df = sqlContext.createDataFrame(rdd, schema) </code></pre> What I want to do is find <code>duration</code> by subtracting <code>EndDateTime</code> and <code>StartDateTime</code>. I figured I'd try and do this using a function: <pre class="prettyprint"><code># Function to calculate time delta def time_delta(y,x): end = pd.to_datetime(y) start = pd.to_datetime(x) delta = (end-start) return delta # create new RDD and add new column 'Duration' by applying time_delta function df2 = df.withColumn('Duration', time_delta(df.EndDateTime, df.StartDateTime)) </code></pre> However this just gives me: <pre class="prettyprint"><code>>>> df2.show() ID EndDateTime StartDateTime ANI Duration X01 2014-02-13T12:36:... 2014-02-13T12:31:... sip:4534454450 null X02 2014-02-13T12:35:... 2014-02-13T12:32:... sip:6413445440 null X03 2014-02-13T12:36:... 2014-02-13T12:32:... sip:4534437492 null XO4 2014-02-13T12:37:... 2014-02-13T12:32:... sip:6474454453 null XO5 2014-02-13T12:36:... 2014-02-13T12:33:... sip:8874458555 null </code></pre> I'm not sure if my approach is correct or not. If not, I'd gladly accept another suggested way to achieve this.

As of Spark 1.5 you can use unix_timestamp: <pre class="prettyprint"><code>from pyspark.sql import functions as F timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS" timeDiff = (F.unix_timestamp('EndDateTime', format=timeFmt) - F.unix_timestamp('StartDateTime', format=timeFmt)) df = df.withColumn("Duration", timeDiff) </code></pre> Note the Java style time format. <pre class="prettyprint"><code>>>> df.show() +---+--------------------+--------------------+--------+ | ID| EndDateTime| StartDateTime|Duration| +---+--------------------+--------------------+--------+ |X01|2014-02-13T12:36:...|2014-02-13T12:31:...| 258| |X02|2014-02-13T12:35:...|2014-02-13T12:32:...| 204| |X03|2014-02-13T12:36:...|2014-02-13T12:32:...| 228| |XO4|2014-02-13T12:37:...|2014-02-13T12:32:...| 269| |XO5|2014-02-13T12:36:...|2014-02-13T12:33:...| 202| +---+--------------------+--------------------+--------+ </code></pre>

Calculating duration by subtracting two datetime columns in string format

Tags:

apache-spark

apache-spark-sql

pyspark

I have a Spark Dataframe in that consists of a series of dates:

from pyspark.sql import SQLContext from pyspark.sql import Row from pyspark.sql.types import * sqlContext = SQLContext(sc) import pandas as pd  rdd = sc.parallelizesc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876','sip:4534454450'),                                     ('X02','2014-02-13T12:35:37.405','2014-02-13T12:32:13.321','sip:6413445440'),                                     ('X03','2014-02-13T12:36:03.825','2014-02-13T12:32:15.229','sip:4534437492'),                                     ('XO4','2014-02-13T12:37:05.460','2014-02-13T12:32:36.881','sip:6474454453'),                                     ('XO5','2014-02-13T12:36:52.721','2014-02-13T12:33:30.323','sip:8874458555')]) schema = StructType([StructField('ID', StringType(), True),                      StructField('EndDateTime', StringType(), True),                      StructField('StartDateTime', StringType(), True)]) df = sqlContext.createDataFrame(rdd, schema)

What I want to do is find duration by subtracting EndDateTime and StartDateTime. I figured I'd try and do this using a function:

# Function to calculate time delta def time_delta(y,x):      end = pd.to_datetime(y)     start = pd.to_datetime(x)     delta = (end-start)     return delta  # create new RDD and add new column 'Duration' by applying time_delta function df2 = df.withColumn('Duration', time_delta(df.EndDateTime, df.StartDateTime))

However this just gives me:

>>> df2.show() ID  EndDateTime          StartDateTime        ANI            Duration X01 2014-02-13T12:36:... 2014-02-13T12:31:... sip:4534454450 null     X02 2014-02-13T12:35:... 2014-02-13T12:32:... sip:6413445440 null     X03 2014-02-13T12:36:... 2014-02-13T12:32:... sip:4534437492 null     XO4 2014-02-13T12:37:... 2014-02-13T12:32:... sip:6474454453 null     XO5 2014-02-13T12:36:... 2014-02-13T12:33:... sip:8874458555 null

I'm not sure if my approach is correct or not. If not, I'd gladly accept another suggested way to achieve this.

981

asked May 17 '15 04:05

Jason

2 Answers

As of Spark 1.5 you can use unix_timestamp:

from pyspark.sql import functions as F timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS" timeDiff = (F.unix_timestamp('EndDateTime', format=timeFmt)             - F.unix_timestamp('StartDateTime', format=timeFmt)) df = df.withColumn("Duration", timeDiff)

Note the Java style time format.

>>> df.show() +---+--------------------+--------------------+--------+ | ID|         EndDateTime|       StartDateTime|Duration| +---+--------------------+--------------------+--------+ |X01|2014-02-13T12:36:...|2014-02-13T12:31:...|     258| |X02|2014-02-13T12:35:...|2014-02-13T12:32:...|     204| |X03|2014-02-13T12:36:...|2014-02-13T12:32:...|     228| |XO4|2014-02-13T12:37:...|2014-02-13T12:32:...|     269| |XO5|2014-02-13T12:36:...|2014-02-13T12:33:...|     202| +---+--------------------+--------------------+--------+

150

answered Sep 19 '22 10:09

Kamil Sindi

Thanks to David Griffin. Here's how to do this for future reference.

from pyspark.sql import SQLContext, Row sqlContext = SQLContext(sc) from pyspark.sql.types import StringType, IntegerType, StructType, StructField from pyspark.sql.functions import udf  # Build sample data rdd = sc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876'),                       ('X02','2014-02-13T12:35:37.405','2014-02-13T12:32:13.321'),                       ('X03','2014-02-13T12:36:03.825','2014-02-13T12:32:15.229'),                       ('XO4','2014-02-13T12:37:05.460','2014-02-13T12:32:36.881'),                       ('XO5','2014-02-13T12:36:52.721','2014-02-13T12:33:30.323')]) schema = StructType([StructField('ID', StringType(), True),                      StructField('EndDateTime', StringType(), True),                      StructField('StartDateTime', StringType(), True)]) df = sqlContext.createDataFrame(rdd, schema)  # define timedelta function (obtain duration in seconds) def time_delta(y,x):      from datetime import datetime     end = datetime.strptime(y, '%Y-%m-%dT%H:%M:%S.%f')     start = datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%f')     delta = (end-start).total_seconds()     return delta  # register as a UDF  f = udf(time_delta, IntegerType())  # Apply function df2 = df.withColumn('Duration', f(df.EndDateTime, df.StartDateTime))

Applying time_delta() will give you duration in seconds:

>>> df2.show() ID  EndDateTime          StartDateTime        Duration X01 2014-02-13T12:36:... 2014-02-13T12:31:... 258      X02 2014-02-13T12:35:... 2014-02-13T12:32:... 204      X03 2014-02-13T12:36:... 2014-02-13T12:32:... 228      XO4 2014-02-13T12:37:... 2014-02-13T12:32:... 268      XO5 2014-02-13T12:36:... 2014-02-13T12:33:... 202

answered Sep 22 '22 10:09

Jason

Related questions
                            
                                writing a csv with column names and reading a csv file which is being generated from a sparksql dataframe in Pyspark
                            
                                Spark Unable to find JDBC Driver
                            
                                Spark 2.0 missing spark implicits
                            
                                Use Spring together with Spark
                            
                                Does Spark support true column scans over parquet files in S3?
                            
                                scalac compile yields "object apache is not a member of package org"
                            
                                Spark-submit not working when application jar is in hdfs
                            
                                How can I force Spark to execute code?
                            
                                Why does Spark fail with "Detected cartesian product for INNER join between logical plans"?
                            
                                remove a column from a dataframe spark
                            
                                Primary keys with Apache Spark
                            
                                How to bin in PySpark?
                            
                                How to write to CSV in Spark
                            
                                fetch more than 20 rows and display full value of column in spark-shell
                            
                                Pyspark filter dataframe by columns of another dataframe
                            
                                Spark: How to translate count(distinct(value)) in Dataframe API's
                            
                                Do exit codes and exit statuses mean anything in spark?
                            
                                Apache Spark vs Apache Ignite [closed]
                            
                                How to load IPython shell with PySpark
                            
                                pyspark: count distinct over a window

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculating duration by subtracting two datetime columns in string format

Tags:

apache-spark

apache-spark-sql

pyspark

Jason

People also ask

2 Answers

Kamil Sindi

Jason

Recent Activity

Donate For Us