I am trying to convert a column containing date and time as strings to timestamp, however I am losing the milliseconds part during the conversion.
I have a Spark dataframe df
that has a date and a time column containing strings. The time string contains milliseconds, as shown below:
+---------+------------+
|date |time |
+---------+------------+
|2018/1/2 |09:53:25.864|
|2018/1/3 |11:32:21.689|
|2018/1/4 |09:34:51.045|
+---------+------------+
I concatenated date
and time
columns to get date_and_time
column (string):
import spark.sql.functions as F
df = df.withColumn('date_and_time', F.concat_ws(' ', df.date, df.time))
df.show(3, False)
Output:
+--------+------------+---------------------+
|date |time |date_and_time |
+--------+------------+---------------------+
|2018/1/2|09:53:25.864|2018/1/2 09:53:25.864|
|2018/1/3|11:32:21.689|2018/1/3 11:32:21.689|
|2018/1/4|09:34:51.045|2018/1/4 09:34:51.045|
+--------+------------+---------------------+
Then, I specified the timestamp format using Simple Date Format Date and Time patterns:
timestamp_format = 'yyyy/M/d HH:mm:ss.SSS'
Then, I tried to convert this string to timestamp using a couple of different ways:
df.select(
df.date_and_time,
F.to_timestamp(df.date_and_time, timestamp_format).alias('method_1'),
F.unix_timestamp(df.date_and_time, format=timestamp_format).cast('timestamp').alias('method_2')
).show(3, False)
As you can see below, the timestamp is missing the milliseconds part:
+---------------------+-------------------+-------------------+
|date_and_time |method_1 |method_2 |
+---------------------+-------------------+-------------------+
|2018/1/2 09:53:25.864|2018-01-02 09:53:25|2018-01-02 09:53:25|
|2018/1/3 11:32:21.689|2018-01-03 11:32:21|2018-01-03 11:32:21|
|2018/1/4 09:34:51.045|2018-01-04 09:34:51|2018-01-04 09:34:51|
+---------------------+-------------------+-------------------+
How can I preserve the milliseconds when converting the string to timestamp?
I am using PySpark (Spark: 2.3.1, Python: 3.6.5).
I have looked at previously answered questions on SO and have not found a suitable solution.
Apache Spark. Use <em>to_timestamp</em> () function to convert String to Timestamp (TimestampType) in PySpark. The converted time would be in a default format of MM-dd-yyyy HH:mm:ss.SSS, I will explain how to use this function with a few examples.
In this example, we will use to_date () function to convert TimestampType (or string) column to DateType column. The input to this function should be timestamp column or string in TimestampType format and it returns just date in DateType column.
A timestamp is mainly used in databases to represent the exact time of some event. The Timestamp class we will use in this tutorial is a part of the java.sql.Timestamp package. We will use the TimeStamp class’s own static function - valueOf (). It takes a string as an argument and then converts it to a timestamp.
second () function extracts seconds component multiplying it by 1000 gets the milliseconds from timestamp view source print? second () function takes up the “birthdaytime” column as input and extracts second part from the timestamp and we multiple 1000 to second part to get milliseconds.
Even though this is an old post, I think it may be useful for people. The solution in https://stackoverflow.com/a/54340652/4383754 is probably the best way that should scale well.
In case, you're looking for a simpler solution that can accept the performance hit of using a python UDF, here is one:
from pyspark.sql.types import TimestampType
from pyspark.sql.functions import udf
from dateutil.parser import parse
data = [('2018/1/2', '09:53:25.864', '2018/1/2 09:53:25.864'),
('2018/1/3', '11:32:21.689', '2018/1/3 11:32:21.689'),
('2018/1/4', '09:34:51.045', '2018/1/4 09:34:51.045')]
df = spark.createDataFrame(
data, 'date STRING, time STRING, date_and_time STRING')
parse_udf = udf(parse, TimestampType())
df = df.withColumn('parsed', parse_udf(df['date_and_time']))
df.show()
# +--------+------------+--------------------+--------------------+
# | date| time| date_and_time| parsed|
# +--------+------------+--------------------+--------------------+
# |2018/1/2|09:53:25.864|2018/1/2 09:53:25...|2018-01-02 09:53:...|
# |2018/1/3|11:32:21.689|2018/1/3 11:32:21...|2018-01-03 11:32:...|
# |2018/1/4|09:34:51.045|2018/1/4 09:34:51...|2018-01-04 09:34:...|
# +--------+------------+--------------------+--------------------+
df.dtypes
# [('date', 'string'),
# ('time', 'string'),
# ('date_and_time', 'string'),
# ('parsed', 'timestamp')]
df[['parsed']].collect()[0][0]
# datetime.datetime(2018, 1, 2, 9, 53, 25, 864000) <- contains microsecond
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With