I am trying to get the difference between two timestamp columns but the milliseconds is gone.
How to correct this?
from pyspark.sql.functions import unix_timestamp
timeFmt = "yyyy-MM-dd' 'HH:mm:ss.SSS"
data = [
(1, '2018-07-25 17:15:06.39','2018-07-25 17:15:06.377'),
(2,'2018-07-25 11:12:49.317','2018-07-25 11:12:48.883')
]
df = spark.createDataFrame(data, ['ID', 'max_ts','min_ts']).withColumn('diff',F.unix_timestamp('max_ts', format=timeFmt) - F.unix_timestamp('min_ts', format=timeFmt))
df.show(truncate = False)
Assuming you already have a dataframe with columns of timestamp type:
from datetime import datetime
data = [
(1, datetime(2018, 7, 25, 17, 15, 6, 390000), datetime(2018, 7, 25, 17, 15, 6, 377000)),
(2, datetime(2018, 7, 25, 11, 12, 49, 317000), datetime(2018, 7, 25, 11, 12, 48, 883000))
]
df = spark.createDataFrame(data, ['ID', 'max_ts','min_ts'])
df.printSchema()
# root
# |-- ID: long (nullable = true)
# |-- max_ts: timestamp (nullable = true)
# |-- min_ts: timestamp (nullable = true)
You can get the time in seconds by casting the timestamp-type column to a double
type, or in milliseconds by multiplying that result by 1000 (and optionally casting to long
if you want an integer).
For example
df.select(
F.col('max_ts').cast('double').alias('time_in_seconds'),
(F.col('max_ts').cast('double') * 1000).cast('long').alias('time_in_milliseconds'),
).toPandas()
# time_in_seconds time_in_milliseconds
# 0 1532538906.390 1532538906390
# 1 1532517169.317 1532517169317
Finally, if you want the difference between the two times in milliseconds, you could do:
df.select(
((F.col('max_ts').cast('double') - F.col('min_ts').cast('double')) * 1000).cast('long').alias('diff_in_milliseconds'),
).toPandas()
# diff_in_milliseconds
# 0 13
# 1 434
I'm doing this on PySpark 2.4.2. There is no need to use string concatenation whatsoever.
That's the intended behavior for unix_timestamp
- it clearly states in the source code docstring it only returns seconds, so the milliseconds component is dropped when doing the calculation.
If you want to have that calculation, you can use the substring
function to concat the numbers and then do the difference. See the example below. Please note that this assumes fully formed data, for example the milliseconds are fulfilled entirely (all 3 digits):
import pyspark.sql.functions as F
timeFmt = "yyyy-MM-dd' 'HH:mm:ss.SSS"
data = [
(1, '2018-07-25 17:15:06.390', '2018-07-25 17:15:06.377'), # note the '390'
(2, '2018-07-25 11:12:49.317', '2018-07-25 11:12:48.883')
]
df = spark.createDataFrame(data, ['ID', 'max_ts', 'min_ts'])\
.withColumn('max_milli', F.unix_timestamp('max_ts', format=timeFmt) + F.substring('max_ts', -3, 3).cast('float')/1000)\
.withColumn('min_milli', F.unix_timestamp('min_ts', format=timeFmt) + F.substring('min_ts', -3, 3).cast('float')/1000)\
.withColumn('diff', (F.col('max_milli') - F.col('min_milli')).cast('float') * 1000)
df.show(truncate=False)
+---+-----------------------+-----------------------+----------------+----------------+---------+
|ID |max_ts |min_ts |max_milli |min_milli |diff |
+---+-----------------------+-----------------------+----------------+----------------+---------+
|1 |2018-07-25 17:15:06.390|2018-07-25 17:15:06.377|1.53255330639E9 |1.532553306377E9|13.000011|
|2 |2018-07-25 11:12:49.317|2018-07-25 11:12:48.883|1.532531569317E9|1.532531568883E9|434.0 |
+---+-----------------------+-----------------------+----------------+----------------+---------+
The answer from Tanjin doesn't work when the values are of type timestamp
and the milliseconds are round numbers (like 390, 500). Python would cut the 0
at the end and the timestamp from the example would look like this 2018-07-25 17:15:06.39
.
The problem is the hardcoded value in F.substring('max_ts', -3, 3)
. If the 0
at the end is missing then the substring
goes wild.
To convert tmpColumn
of type timestamp
column to tmpLongColumn
of type long
I used this snippet:
timeFmt = "yyyy-MM-dd HH:mm:ss.SSS"
df = df \
.withColumn('tmpLongColumn', F.substring_index('tmpColumn', '.', -1).cast('float')) \
.withColumn('tmpLongColumn', F.when(F.col('tmpLongColumn') < 100, F.col('tmpLongColumn')*10).otherwise(F.col('tmpLongColumn')).cast('long')) \
.withColumn('tmpLongColumn', (F.unix_timestamp('tmpColumn', format=timeFmt)*1000 + F.col('tmpLongColumn'))) \
The first transformation extracts the substring containing the milliseconds. Next, if the value is less then 100 multiply it by 10. Finally, convert the timestamp and add milliseconds.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With