Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract date from a string column containing timestamp in Pyspark

I have a dataframe which has a date in the following format:

+----------------------+
|date                  |
+----------------------+
|May 6, 2016 5:59:34 AM|
+----------------------+

I intend to extract the date from this in the format YYYY-MM-DD ; so the result should be for the above date - 2016-05-06.

But when I extract is using the following:

df.withColumn('part_date', from_unixtime(unix_timestamp(df.date, "MMM dd, YYYY hh:mm:ss aa"), "yyyy-MM-dd"))

I get the following date

2015-12-27

Can anyone please advise on this? I do not intend to convert my df to rdd to use datetime function from python and want to use this in the dataframe it self.

like image 548
learning Avatar asked May 19 '16 18:05

learning


1 Answers

There are some errors with your pattern. Here's a suggestion:

from_pattern = 'MMM d, yyyy h:mm:ss aa'
to_pattern = 'yyyy-MM-dd'
df.withColumn('part_date', from_unixtime(unix_timestamp(df['date'], from_pattern), to_pattern)).show()
+----------------------+----------+
|date                  |part_date |
+----------------------+----------+
|May 6, 2016 5:59:34 AM|2016-05-06|
+----------------------+----------+
like image 96
Daniel de Paula Avatar answered Nov 02 '22 23:11

Daniel de Paula