Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark Milliseconds of TimeStamp

Tags:

pyspark

I am trying to get the difference between two timestamp columns but the milliseconds is gone.

How to correct this?

from pyspark.sql.functions import unix_timestamp
timeFmt = "yyyy-MM-dd' 'HH:mm:ss.SSS"

data = [
    (1, '2018-07-25 17:15:06.39','2018-07-25 17:15:06.377'),
    (2,'2018-07-25 11:12:49.317','2018-07-25 11:12:48.883')

]

df = spark.createDataFrame(data, ['ID', 'max_ts','min_ts']).withColumn('diff',F.unix_timestamp('max_ts', format=timeFmt) - F.unix_timestamp('min_ts', format=timeFmt))
df.show(truncate = False)
like image 631
Keerikkattu Chellappan Avatar asked Mar 01 '19 19:03

Keerikkattu Chellappan


3 Answers

Assuming you already have a dataframe with columns of timestamp type:

from datetime import datetime

data = [
    (1, datetime(2018, 7, 25, 17, 15, 6, 390000), datetime(2018, 7, 25, 17, 15, 6, 377000)),
    (2, datetime(2018, 7, 25, 11, 12, 49, 317000), datetime(2018, 7, 25, 11, 12, 48, 883000))
]

df = spark.createDataFrame(data, ['ID', 'max_ts','min_ts'])
df.printSchema()

# root
#  |-- ID: long (nullable = true)
#  |-- max_ts: timestamp (nullable = true)
#  |-- min_ts: timestamp (nullable = true)

You can get the time in seconds by casting the timestamp-type column to a double type, or in milliseconds by multiplying that result by 1000 (and optionally casting to long if you want an integer). For example

df.select(
    F.col('max_ts').cast('double').alias('time_in_seconds'),
    (F.col('max_ts').cast('double') * 1000).cast('long').alias('time_in_milliseconds'),
).toPandas()

#     time_in_seconds  time_in_milliseconds
# 0    1532538906.390         1532538906390
# 1    1532517169.317         1532517169317

Finally, if you want the difference between the two times in milliseconds, you could do:

df.select(
    ((F.col('max_ts').cast('double') - F.col('min_ts').cast('double')) * 1000).cast('long').alias('diff_in_milliseconds'),
).toPandas()

#    diff_in_milliseconds
# 0                    13
# 1                   434

I'm doing this on PySpark 2.4.2. There is no need to use string concatenation whatsoever.

like image 115
Troy Avatar answered Sep 26 '22 20:09

Troy


That's the intended behavior for unix_timestamp - it clearly states in the source code docstring it only returns seconds, so the milliseconds component is dropped when doing the calculation.

If you want to have that calculation, you can use the substring function to concat the numbers and then do the difference. See the example below. Please note that this assumes fully formed data, for example the milliseconds are fulfilled entirely (all 3 digits):

import pyspark.sql.functions as F

timeFmt = "yyyy-MM-dd' 'HH:mm:ss.SSS"
data = [
    (1, '2018-07-25 17:15:06.390', '2018-07-25 17:15:06.377'),  # note the '390'
    (2, '2018-07-25 11:12:49.317', '2018-07-25 11:12:48.883')
]

df = spark.createDataFrame(data, ['ID', 'max_ts', 'min_ts'])\
    .withColumn('max_milli', F.unix_timestamp('max_ts', format=timeFmt) + F.substring('max_ts', -3, 3).cast('float')/1000)\
    .withColumn('min_milli', F.unix_timestamp('min_ts', format=timeFmt) + F.substring('min_ts', -3, 3).cast('float')/1000)\
    .withColumn('diff', (F.col('max_milli') - F.col('min_milli')).cast('float') * 1000)

df.show(truncate=False)

+---+-----------------------+-----------------------+----------------+----------------+---------+
|ID |max_ts                 |min_ts                 |max_milli       |min_milli       |diff     |
+---+-----------------------+-----------------------+----------------+----------------+---------+
|1  |2018-07-25 17:15:06.390|2018-07-25 17:15:06.377|1.53255330639E9 |1.532553306377E9|13.000011|
|2  |2018-07-25 11:12:49.317|2018-07-25 11:12:48.883|1.532531569317E9|1.532531568883E9|434.0    |
+---+-----------------------+-----------------------+----------------+----------------+---------+
like image 39
Tanjin Avatar answered Sep 26 '22 20:09

Tanjin


The answer from Tanjin doesn't work when the values are of type timestamp and the milliseconds are round numbers (like 390, 500). Python would cut the 0 at the end and the timestamp from the example would look like this 2018-07-25 17:15:06.39.

The problem is the hardcoded value in F.substring('max_ts', -3, 3). If the 0 at the end is missing then the substring goes wild.

To convert tmpColumn of type timestamp column to tmpLongColumn of type long I used this snippet:

timeFmt = "yyyy-MM-dd HH:mm:ss.SSS"

df = df \
  .withColumn('tmpLongColumn', F.substring_index('tmpColumn', '.', -1).cast('float')) \
  .withColumn('tmpLongColumn', F.when(F.col('tmpLongColumn') < 100, F.col('tmpLongColumn')*10).otherwise(F.col('tmpLongColumn')).cast('long')) \
  .withColumn('tmpLongColumn', (F.unix_timestamp('tmpColumn', format=timeFmt)*1000 + F.col('tmpLongColumn'))) \

The first transformation extracts the substring containing the milliseconds. Next, if the value is less then 100 multiply it by 10. Finally, convert the timestamp and add milliseconds.

like image 40
kaichi Avatar answered Sep 24 '22 20:09

kaichi