Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I truncate a PySpark dataframe of timestamp type to the day?

I have a PySpark dataframe that includes timestamps in a column (call the column 'dt'), like this:

2018-04-07 16:46:00
2018-03-06 22:18:00

When I execute:

SELECT trunc(dt, 'day') as day

...I expected:

2018-04-07 00:00:00
2018-03-06 00:00:00

But I got:

null
null

How do I truncate to the day instead of the hour?

like image 630
Jared Avatar asked Apr 20 '18 18:04

Jared


People also ask

How do I convert a timestamp to a date in PySpark?

The to_date() function in Apache PySpark is popularly used to convert Timestamp to the date. This is mostly achieved by truncating the Timestamp column's time part. The to_date() function takes TimeStamp as it's input in the default format of "MM-dd-yyyy HH:mm:ss. SSS".

How do I change the format of a timestamp in PySpark?

PySpark to_timestamp() – Convert String to Timestamp typeUse <em>to_timestamp</em>() function to convert String to Timestamp (TimestampType) in PySpark. The converted time would be in a default format of MM-dd-yyyy HH:mm:ss.

How do you trim dates in PySpark?

Truncating Date using trunc() Spark SQL function Spark SQL DateFrame functions provide trunc() function to truncate Date at Year and Month units and returns Date in Spark DateType format “yyyy-MM-dd”. Note that Day doesn't support by trunc() function and it returns null when used.

How do you change a string type to date in PySpark?

PySpark to_date() – Convert String to Date Format to_date() – function is used to format string ( StringType ) to date ( DateType ) column. This function takes the first argument as a date string and the second argument takes the pattern the date is in the first argument.


1 Answers

You use wrong function. trunc supports only a few formats:

Returns date truncated to the unit specified by the format.

:param format: 'year', 'yyyy', 'yy' or 'month', 'mon', 'mm'

Use date_trunc instead:

Returns timestamp truncated to the unit specified by the format.

:param format: 'year', 'yyyy', 'yy', 'month', 'mon', 'mm', 'day', 'dd', 'hour', 'minute', 'second', 'week', 'quarter'

Example:

from pyspark.sql.functions import col, date_trunc

df = spark.createDataFrame(["2018-04-07 23:33:21"], "string").toDF("dt").select(col("dt").cast("timestamp"))

df.select(date_trunc("day", "dt")).show()
# +-------------------+                                                           
# |date_trunc(day, dt)|
# +-------------------+
# |2018-04-07 00:00:00|
# +-------------------+
like image 167
Alper t. Turker Avatar answered Oct 05 '22 01:10

Alper t. Turker