How to preserve milliseconds when converting a date and time string to timestamp using PySpark?

Data

I have a Spark dataframe df that has a date and a time column containing strings. The time string contains milliseconds, as shown below:

+---------+------------+
|date     |time        |
+---------+------------+
|2018/1/2 |09:53:25.864|
|2018/1/3 |11:32:21.689|
|2018/1/4 |09:34:51.045|
+---------+------------+

What I tried

I concatenated date and time columns to get date_and_time column (string):

import spark.sql.functions as F

df = df.withColumn('date_and_time', F.concat_ws(' ', df.date, df.time))

df.show(3, False)

Output:

+--------+------------+---------------------+
|date    |time        |date_and_time        |
+--------+------------+---------------------+
|2018/1/2|09:53:25.864|2018/1/2 09:53:25.864|
|2018/1/3|11:32:21.689|2018/1/3 11:32:21.689|
|2018/1/4|09:34:51.045|2018/1/4 09:34:51.045|
+--------+------------+---------------------+

Then, I specified the timestamp format using Simple Date Format Date and Time patterns:

timestamp_format = 'yyyy/M/d HH:mm:ss.SSS'

Then, I tried to convert this string to timestamp using a couple of different ways:

df.select(
    df.date_and_time,
    F.to_timestamp(df.date_and_time, timestamp_format).alias('method_1'),
    F.unix_timestamp(df.date_and_time, format=timestamp_format).cast('timestamp').alias('method_2')
).show(3, False)

As you can see below, the timestamp is missing the milliseconds part:

+---------------------+-------------------+-------------------+
|date_and_time        |method_1           |method_2           |
+---------------------+-------------------+-------------------+
|2018/1/2 09:53:25.864|2018-01-02 09:53:25|2018-01-02 09:53:25|
|2018/1/3 11:32:21.689|2018-01-03 11:32:21|2018-01-03 11:32:21|
|2018/1/4 09:34:51.045|2018-01-04 09:34:51|2018-01-04 09:34:51|
+---------------------+-------------------+-------------------+

How can I preserve the milliseconds when converting the string to timestamp?

I am using PySpark (Spark: 2.3.1, Python: 3.6.5).

I have looked at previously answered questions on SO and have not found a suitable solution.

374

asked Nov 14 '18 16:11

Rahul

1 Answers

Even though this is an old post, I think it may be useful for people. The solution in https://stackoverflow.com/a/54340652/4383754 is probably the best way that should scale well.

In case, you're looking for a simpler solution that can accept the performance hit of using a python UDF, here is one:

from pyspark.sql.types import TimestampType
from pyspark.sql.functions import udf
from dateutil.parser import parse
data = [('2018/1/2', '09:53:25.864', '2018/1/2 09:53:25.864'),
        ('2018/1/3', '11:32:21.689', '2018/1/3 11:32:21.689'),
        ('2018/1/4', '09:34:51.045', '2018/1/4 09:34:51.045')]
df = spark.createDataFrame(
    data, 'date STRING, time STRING, date_and_time STRING')
parse_udf = udf(parse, TimestampType())
df = df.withColumn('parsed', parse_udf(df['date_and_time']))
df.show()
# +--------+------------+--------------------+--------------------+
# |    date|        time|       date_and_time|              parsed|
# +--------+------------+--------------------+--------------------+
# |2018/1/2|09:53:25.864|2018/1/2 09:53:25...|2018-01-02 09:53:...|
# |2018/1/3|11:32:21.689|2018/1/3 11:32:21...|2018-01-03 11:32:...|
# |2018/1/4|09:34:51.045|2018/1/4 09:34:51...|2018-01-04 09:34:...|
# +--------+------------+--------------------+--------------------+

df.dtypes
# [('date', 'string'),
#  ('time', 'string'),
#  ('date_and_time', 'string'),
#  ('parsed', 'timestamp')]

df[['parsed']].collect()[0][0]
# datetime.datetime(2018, 1, 2, 9, 53, 25, 864000) <- contains microsecond

197

answered Oct 13 '22 18:10

Ankur

Related questions
                            
                                Python `argparse`: Is there a clean way to add a flag that sets multiple flags (e.g. `--all`" is equivalent to `--x --y`)
                            
                                How to declare multiple variables with type annotation syntax in Python?
                            
                                How to overcome "OperationalError: too many SQL variables"
                            
                                Python Deployment Package with SKLEARN, PANDAS and NUMPY issue?
                            
                                What's the sequence of middleware execution in django when error occurs in process_request?
                            
                                Getting "title already used as a name or title" error while reading SPSS (.sav) file in Python
                            
                                Can't import subprocess python3.6
                            
                                Typing __exit__ in 3.5 fails on runtime, but typechecks
                            
                                How to really create n tasks in a SubDAG based on the result of a previous task
                            
                                TemplateResponseMixin requires either a definition of 'template_name' or an implementation of 'get_template_names()'
                            
                                Pandas pivot_table with pd.grouper and Margins
                            
                                Use TensorFlow python code with android app
                            
                                Sorting a dictionary with multiple sized values
                            
                                PyInstaller - How do you handle environmental variables?
                            
                                Selenium does not work with a chromedriver modified to avoid detection
                            
                                Match entities by fuzzy matching of multiple variables
                            
                                What is Killed:9 and how to fix in macOS Terminal?
                            
                                Rename pandas dataframe columns whose type is RangeIndex [duplicate]
                            
                                Scatter 3D for Large Data-Set in Plotly
                            
                                Unable to load libhdfs when using pyarrow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to preserve milliseconds when converting a date and time string to timestamp using PySpark?

Tags:

python

timestamp

python-3.x

apache-spark

pyspark