Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert timestamp to date in Spark dataframe

I've seen (here: How to convert Timestamp to Date format in DataFrame?) the way to convert a timestamp in datetype, but,at least for me, it doesn't work.

Here is what I've tried:

# Create dataframe
df_test = spark.createDataFrame([('20170809',), ('20171007',)], ['date',])

# Convert to timestamp
df_test2 = df_test.withColumn('timestamp',func.when((df_test.date.isNull() | (df_test.date == '')) , '0')\
.otherwise(func.unix_timestamp(df_test.date,'yyyyMMdd')))\

# Convert timestamp to date again
df_test2.withColumn('date_again', df_test2['timestamp'].cast(stypes.DateType())).show()

But this returns null in the column date_again:

+--------+----------+----------+
|    date| timestamp|date_again|
+--------+----------+----------+
|20170809|1502229600|      null|
|20171007|1507327200|      null|
+--------+----------+----------+

Any idea of what's failing?

like image 699
Luis A.G. Avatar asked Aug 31 '17 09:08

Luis A.G.


3 Answers

Following:

func.when((df_test.date.isNull() | (df_test.date == '')) , '0')\
  .otherwise(func.unix_timestamp(df_test.date,'yyyyMMdd'))

doesn't work because it is type inconsistent - the first clause returns string while the second clause returns bigint. As a result it will always return NULL if data is NOT NULL and not empty.

It is also obsolete - SQL functions are NULL and malformed format safe. There is no need for additional checks.

In [1]: spark.sql("SELECT unix_timestamp(NULL, 'yyyyMMdd')").show()
+----------------------------------------------+
|unix_timestamp(CAST(NULL AS STRING), yyyyMMdd)|
+----------------------------------------------+
|                                          null|
+----------------------------------------------+


In [2]: spark.sql("SELECT unix_timestamp('', 'yyyyMMdd')").show()
+--------------------------+
|unix_timestamp(, yyyyMMdd)|
+--------------------------+
|                      null|
+--------------------------+

And you don't need intermediate step in Spark 2.2 or later:

from pyspark.sql.functions import to_date

to_date("date", "yyyyMMdd")
like image 176
Alper t. Turker Avatar answered Nov 07 '22 09:11

Alper t. Turker


you should be doing the following

>>> df_test2.withColumn('date_again', func.from_unixtime('timestamp').cast(DateType())).show()
+--------+----------+----------+
|    date| timestamp|date_again|
+--------+----------+----------+
|20170809|1502216100|2017-08-09|
|20171007|1507313700|2017-10-07|
+--------+----------+----------+

and schema is

>>> df_test2.withColumn('date_again', func.from_unixtime('timestamp').cast(DateType())).printSchema()
root
 |-- date: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- date_again: date (nullable = true)
like image 37
Ramesh Maharjan Avatar answered Nov 07 '22 09:11

Ramesh Maharjan


To convert a unix_timestamp column (called TIMESTMP) in a pyspark dataframe (df) -- to a Date type:

Below is a two step process (there may be a shorter way):

  • convert from UNIX timestamp to timestamp
  • convert from timestamp to Date

Initially the df.printShchema() shows: -- TIMESTMP: long (nullable = true)

use spark.SQL to implement the conversion as follows:

df.registerTempTable("dfTbl")

dfNew= spark.sql("""
                     SELECT *, cast(TIMESTMP as Timestamp) as newTIMESTMP 
                     FROM dfTbl d
                  """)

dfNew.printSchema()

the printSchema() will show:

-- newTIMESTMP: timestamp (nullable = true)

finally convert the type from timestamp to Date as follows:

from pyspark.sql.types import DateType
dfNew=dfNew.withColumn('actual_date', dfNew['newTIMESTMP'].cast(DateType()))
like image 8
Grant Shannon Avatar answered Nov 07 '22 09:11

Grant Shannon