Converting epoch to datetime in PySpark data frame using udf

Question

I have a PySpark dataframe with this schema:

root
 |-- epoch: double (nullable = true)
 |-- var1: double (nullable = true)
 |-- var2: double (nullable = true)

Where epoch is in seconds and should be converted to date time. In order to do so, I define a user defined function (udf) as follows:

from pyspark.sql.functions import udf    
import time
def epoch_to_datetime(x):
    return time.localtime(x)
    # return time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x))
    # return x * 0 + 1

epoch_to_datetime_udf = udf(epoch_to_datetime, DoubleType())
df.withColumn("datetime", epoch_to_datetime(df2.epoch)).show()

I get this error:

---> 21     return time.localtime(x)
    22     # return x * 0 + 1
    23 
    TypeError: a float is required

If I simply return x + 1 in the function, it works. Trying float(x) or float(str(x)) or numpy.float(x) in time.localtime(x) does not help and I still get an error. Outside of udf, time.localtime(1.514687216E9) or other numbers works fine. Using datetime package to convert epoch to datetim results in similar errors.

It seems that time and datetime packages do not like to fed with DoubleType from PySpark. Any ideas how I can solve this issue? Thanks.

Ramesh Maharjan · Accepted Answer

You don't need a udf function for that

All you need is to cast the double epoch column to timestampType() and then use data_format function as below

from pyspark.sql import functions as f
from pyspark.sql import types as t
df.withColumn('epoch', f.date_format(df.epoch.cast(dataType=t.TimestampType()), "yyyy-MM-dd"))

this will give you a string date

root
 |-- epoch: string (nullable = true)
 |-- var1: double (nullable = true)
 |-- var2: double (nullable = true)

And you can use to_date function as following

from pyspark.sql import functions as f
from pyspark.sql import types as t
df.withColumn('epoch', f.to_date(df.epoch.cast(dataType=t.TimestampType())))

which would give you date as datatype to epoch column

root
 |-- epoch: date (nullable = true)
 |-- var1: double (nullable = true)
 |-- var2: double (nullable = true)

I hope the answer is helpful

Tim · Answer

For me i need to convert the long timestamp back to date format.

I used @Glicth comment which worked for me. - might help other.

from pyspark.sql import functions as f
from pyspark.sql.functions import col,lit
from datetime import datetime

df001 = spark.createDataFrame([(1639518261056, ),(1639518260824,)], ['timestamp_long'])
df002 = df001.withColumn("timestamp",f.to_timestamp(df001['timestamp_long']/1000))
df001.printSchema()
display(df002)

schema

root
 |-- timestamp_long: long (nullable = true)

Using Databricks: output of display(df002) enter image description here

Converting epoch to datetime in PySpark data frame using udf

Tags:

python

apache-spark

apache-spark-sql

pyspark

ahoosh

2 Answers

Ramesh Maharjan

Tim

Recent Activity

Donate For Us

Converting epoch to datetime in PySpark data frame using udf

Tags:

python

apache-spark

apache-spark-sql

pyspark

ahoosh

2 Answers

Ramesh Maharjan

Tim

Related questions

Recent Activity

Donate For Us