PySpark Milliseconds of TimeStamp

Question

I am trying to get the difference between two timestamp columns but the milliseconds is gone.

How to correct this?

from pyspark.sql.functions import unix_timestamp
timeFmt = "yyyy-MM-dd' 'HH:mm:ss.SSS"

data = [
    (1, '2018-07-25 17:15:06.39','2018-07-25 17:15:06.377'),
    (2,'2018-07-25 11:12:49.317','2018-07-25 11:12:48.883')

]

df = spark.createDataFrame(data, ['ID', 'max_ts','min_ts']).withColumn('diff',F.unix_timestamp('max_ts', format=timeFmt) - F.unix_timestamp('min_ts', format=timeFmt))
df.show(truncate = False)

Troy · Accepted Answer

Assuming you already have a dataframe with columns of timestamp type:

from datetime import datetime

data = [
    (1, datetime(2018, 7, 25, 17, 15, 6, 390000), datetime(2018, 7, 25, 17, 15, 6, 377000)),
    (2, datetime(2018, 7, 25, 11, 12, 49, 317000), datetime(2018, 7, 25, 11, 12, 48, 883000))
]

df = spark.createDataFrame(data, ['ID', 'max_ts','min_ts'])
df.printSchema()

# root
#  |-- ID: long (nullable = true)
#  |-- max_ts: timestamp (nullable = true)
#  |-- min_ts: timestamp (nullable = true)

You can get the time in seconds by casting the timestamp-type column to a double type, or in milliseconds by multiplying that result by 1000 (and optionally casting to long if you want an integer). For example

df.select(
    F.col('max_ts').cast('double').alias('time_in_seconds'),
    (F.col('max_ts').cast('double') * 1000).cast('long').alias('time_in_milliseconds'),
).toPandas()

#     time_in_seconds  time_in_milliseconds
# 0    1532538906.390         1532538906390
# 1    1532517169.317         1532517169317

Finally, if you want the difference between the two times in milliseconds, you could do:

df.select(
    ((F.col('max_ts').cast('double') - F.col('min_ts').cast('double')) * 1000).cast('long').alias('diff_in_milliseconds'),
).toPandas()

#    diff_in_milliseconds
# 0                    13
# 1                   434

I'm doing this on PySpark 2.4.2. There is no need to use string concatenation whatsoever.

Tanjin · Answer

That's the intended behavior for unix_timestamp - it clearly states in the source code docstring it only returns seconds, so the milliseconds component is dropped when doing the calculation.

If you want to have that calculation, you can use the substring function to concat the numbers and then do the difference. See the example below. Please note that this assumes fully formed data, for example the milliseconds are fulfilled entirely (all 3 digits):

import pyspark.sql.functions as F

timeFmt = "yyyy-MM-dd' 'HH:mm:ss.SSS"
data = [
    (1, '2018-07-25 17:15:06.390', '2018-07-25 17:15:06.377'),  # note the '390'
    (2, '2018-07-25 11:12:49.317', '2018-07-25 11:12:48.883')
]

df = spark.createDataFrame(data, ['ID', 'max_ts', 'min_ts'])\
    .withColumn('max_milli', F.unix_timestamp('max_ts', format=timeFmt) + F.substring('max_ts', -3, 3).cast('float')/1000)\
    .withColumn('min_milli', F.unix_timestamp('min_ts', format=timeFmt) + F.substring('min_ts', -3, 3).cast('float')/1000)\
    .withColumn('diff', (F.col('max_milli') - F.col('min_milli')).cast('float') * 1000)

df.show(truncate=False)

+---+-----------------------+-----------------------+----------------+----------------+---------+
|ID |max_ts                 |min_ts                 |max_milli       |min_milli       |diff     |
+---+-----------------------+-----------------------+----------------+----------------+---------+
|1  |2018-07-25 17:15:06.390|2018-07-25 17:15:06.377|1.53255330639E9 |1.532553306377E9|13.000011|
|2  |2018-07-25 11:12:49.317|2018-07-25 11:12:48.883|1.532531569317E9|1.532531568883E9|434.0    |
+---+-----------------------+-----------------------+----------------+----------------+---------+

kaichi · Answer

The answer from Tanjin doesn't work when the values are of type timestamp and the milliseconds are round numbers (like 390, 500). Python would cut the 0 at the end and the timestamp from the example would look like this 2018-07-25 17:15:06.39.

The problem is the hardcoded value in F.substring('max_ts', -3, 3). If the 0 at the end is missing then the substring goes wild.

To convert tmpColumn of type timestamp column to tmpLongColumn of type long I used this snippet:

timeFmt = "yyyy-MM-dd HH:mm:ss.SSS"

df = df \
  .withColumn('tmpLongColumn', F.substring_index('tmpColumn', '.', -1).cast('float')) \
  .withColumn('tmpLongColumn', F.when(F.col('tmpLongColumn') < 100, F.col('tmpLongColumn')*10).otherwise(F.col('tmpLongColumn')).cast('long')) \
  .withColumn('tmpLongColumn', (F.unix_timestamp('tmpColumn', format=timeFmt)*1000 + F.col('tmpLongColumn'))) \

The first transformation extracts the substring containing the milliseconds. Next, if the value is less then 100 multiply it by 10. Finally, convert the timestamp and add milliseconds.

PySpark Milliseconds of TimeStamp

Tags:

pyspark

Keerikkattu Chellappan

3 Answers

Troy

Tanjin

kaichi

Recent Activity

Donate For Us

PySpark Milliseconds of TimeStamp

Tags:

pyspark

Keerikkattu Chellappan

3 Answers

Troy

Tanjin

kaichi

Related questions

Recent Activity

Donate For Us