I want to split timestamp value into date and time.
eg:
1/20/2016 3:20:30 PM
1/20/2016 3:20:31 PM
1/20/2016 3:20:32 PM
1/20/2016 3:20:32 PM
1/20/2016 3:20:32 PM
1/20/2016 3:20:33 PM
1/20/2016 3:20:34 PM
1/20/2016 3:20:34 PM
needs to be split into 1/20/2016 and 3:20:30 PM
using sql spilt function I am unable to process it correctly
split_col = pyspark.sql.functions.split(df['ServerTime'], ' ')
df_date = df.withColumn('Date', split_col.getItem(0))
df_time = df.withColumn('Time', split_col.getItem(1))
Any help guys????
As the date and time can come in any format, the right way of doing this is to convert the date strings to a Datetype() and them extract Date and Time part from it. The Date and Time parts can be extracted as follows any format. You can project these two into separate dataframes if you want.
The to_date() function in Apache PySpark is popularly used to convert Timestamp to the date. This is mostly achieved by truncating the Timestamp column's time part. The to_date() function takes TimeStamp as it's input in the default format of "MM-dd-yyyy HH:mm:ss. SSS".
In order to get the date from the timestamp, you can use DATE() function from MySQL.
As the date and time can come in any format, the right way of doing this is to convert the date strings to a Datetype() and them extract Date and Time part from it.
Let take the below sample data
server_times = sc.parallelize([('1/20/2016 3:20:30 PM',),
('1/20/2016 3:20:31 PM',),
('1/20/2016 3:20:32 PM',)]).toDF(['ServerTime'])
The Date and Time parts can be extracted as follows any format.
from pyspark.sql.functions import unix_timestamp, from_unixtime, date_format
df.select(unix_timestamp(df.ServerTime, 'm/d/yyyy h:m:ss a').alias('ut'))\
.select(from_unixtime('ut').alias('dty'))\
.select(date_format('dty', 'M/d/yyyy').alias('Date'),
date_format('dty', 'h:m:s a').alias('Time'))\
.show()
+---------+----------+
| Date| Time|
+---------+----------+
|1/20/2016|3:20:30 PM|
|1/20/2016|3:20:31 PM|
|1/20/2016|3:20:32 PM|
+---------+----------+
You can project these two into separate dataframes if you want.
You could use pyspark.sql.functions.concat
to concatenate the relevant time bits together again. Let's first create some test data:
df = sc.parallelize([('1/20/2016 3:20:30 PM',),
('1/20/2016 3:20:31 PM',),
('1/20/2016 3:20:32 PM',)]).toDF(['ServerTime'])
You can do this:
import pyspark.sql.functions as F
split_col = pyspark.sql.functions.split(df['ServerTime'], ' ')
df_date = df.withColumn('Date', split_col.getItem(0))
df_time = df.withColumn('Time', F.concat(split_col.getItem(1),F.lit(' '),split_col.getItem(2)))
After running df_time.show()
, the following output is returned:
+--------------------+----------+
| ServerTime| Time|
+--------------------+----------+
|1/20/2016 3:20:30 PM|3:20:30 PM|
|1/20/2016 3:20:31 PM|3:20:31 PM|
|1/20/2016 3:20:32 PM|3:20:32 PM|
+--------------------+----------+
Running df_date.show()
returns:
+--------------------+---------+
| ServerTime| Date|
+--------------------+---------+
|1/20/2016 3:20:30 PM|1/20/2016|
|1/20/2016 3:20:31 PM|1/20/2016|
|1/20/2016 3:20:32 PM|1/20/2016|
+--------------------+---------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With