Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to preserve milliseconds when converting a date and time string to timestamp using PySpark?

I am trying to convert a column containing date and time as strings to timestamp, however I am losing the milliseconds part during the conversion.

Data

I have a Spark dataframe df that has a date and a time column containing strings. The time string contains milliseconds, as shown below:

+---------+------------+
|date     |time        |
+---------+------------+
|2018/1/2 |09:53:25.864|
|2018/1/3 |11:32:21.689|
|2018/1/4 |09:34:51.045|
+---------+------------+

What I tried

I concatenated date and time columns to get date_and_time column (string):

import spark.sql.functions as F

df = df.withColumn('date_and_time', F.concat_ws(' ', df.date, df.time))

df.show(3, False)

Output:

+--------+------------+---------------------+
|date    |time        |date_and_time        |
+--------+------------+---------------------+
|2018/1/2|09:53:25.864|2018/1/2 09:53:25.864|
|2018/1/3|11:32:21.689|2018/1/3 11:32:21.689|
|2018/1/4|09:34:51.045|2018/1/4 09:34:51.045|
+--------+------------+---------------------+

Then, I specified the timestamp format using Simple Date Format Date and Time patterns:

timestamp_format = 'yyyy/M/d HH:mm:ss.SSS'

Then, I tried to convert this string to timestamp using a couple of different ways:

df.select(
    df.date_and_time,
    F.to_timestamp(df.date_and_time, timestamp_format).alias('method_1'),
    F.unix_timestamp(df.date_and_time, format=timestamp_format).cast('timestamp').alias('method_2')
).show(3, False)

As you can see below, the timestamp is missing the milliseconds part:

+---------------------+-------------------+-------------------+
|date_and_time        |method_1           |method_2           |
+---------------------+-------------------+-------------------+
|2018/1/2 09:53:25.864|2018-01-02 09:53:25|2018-01-02 09:53:25|
|2018/1/3 11:32:21.689|2018-01-03 11:32:21|2018-01-03 11:32:21|
|2018/1/4 09:34:51.045|2018-01-04 09:34:51|2018-01-04 09:34:51|
+---------------------+-------------------+-------------------+

How can I preserve the milliseconds when converting the string to timestamp?

I am using PySpark (Spark: 2.3.1, Python: 3.6.5).

I have looked at previously answered questions on SO and have not found a suitable solution.

like image 374
Rahul Avatar asked Nov 14 '18 16:11

Rahul


People also ask

How to convert string to timestamp in pyspark?

Apache Spark. Use <em>to_timestamp</em> () function to convert String to Timestamp (TimestampType) in PySpark. The converted time would be in a default format of MM-dd-yyyy HH:mm:ss.SSS, I will explain how to use this function with a few examples.

How to convert timestamptype (or string) to datetype in MySQL?

In this example, we will use to_date () function to convert TimestampType (or string) column to DateType column. The input to this function should be timestamp column or string in TimestampType format and it returns just date in DateType column.

How to convert string to timestamp in Java?

A timestamp is mainly used in databases to represent the exact time of some event. The Timestamp class we will use in this tutorial is a part of the java.sql.Timestamp package. We will use the TimeStamp class’s own static function - valueOf (). It takes a string as an argument and then converts it to a timestamp.

How to get the milliseconds from timestamp view source print?

second () function extracts seconds component multiplying it by 1000 gets the milliseconds from timestamp view source print? second () function takes up the “birthdaytime” column as input and extracts second part from the timestamp and we multiple 1000 to second part to get milliseconds.


1 Answers

Even though this is an old post, I think it may be useful for people. The solution in https://stackoverflow.com/a/54340652/4383754 is probably the best way that should scale well.

In case, you're looking for a simpler solution that can accept the performance hit of using a python UDF, here is one:

from pyspark.sql.types import TimestampType
from pyspark.sql.functions import udf
from dateutil.parser import parse
data = [('2018/1/2', '09:53:25.864', '2018/1/2 09:53:25.864'),
        ('2018/1/3', '11:32:21.689', '2018/1/3 11:32:21.689'),
        ('2018/1/4', '09:34:51.045', '2018/1/4 09:34:51.045')]
df = spark.createDataFrame(
    data, 'date STRING, time STRING, date_and_time STRING')
parse_udf = udf(parse, TimestampType())
df = df.withColumn('parsed', parse_udf(df['date_and_time']))
df.show()
# +--------+------------+--------------------+--------------------+
# |    date|        time|       date_and_time|              parsed|
# +--------+------------+--------------------+--------------------+
# |2018/1/2|09:53:25.864|2018/1/2 09:53:25...|2018-01-02 09:53:...|
# |2018/1/3|11:32:21.689|2018/1/3 11:32:21...|2018-01-03 11:32:...|
# |2018/1/4|09:34:51.045|2018/1/4 09:34:51...|2018-01-04 09:34:...|
# +--------+------------+--------------------+--------------------+

df.dtypes
# [('date', 'string'),
#  ('time', 'string'),
#  ('date_and_time', 'string'),
#  ('parsed', 'timestamp')]

df[['parsed']].collect()[0][0]
# datetime.datetime(2018, 1, 2, 9, 53, 25, 864000) <- contains microsecond
like image 197
Ankur Avatar answered Oct 13 '22 18:10

Ankur