Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I split a timestamp column into date and time in spark

Tags:

pyspark

I want to split timestamp value into date and time.

eg:

1/20/2016 3:20:30 PM
1/20/2016 3:20:31 PM
1/20/2016 3:20:32 PM
1/20/2016 3:20:32 PM
1/20/2016 3:20:32 PM
1/20/2016 3:20:33 PM
1/20/2016 3:20:34 PM
1/20/2016 3:20:34 PM

needs to be split into 1/20/2016 and 3:20:30 PM

using sql spilt function I am unable to process it correctly

split_col = pyspark.sql.functions.split(df['ServerTime'], ' ')
df_date = df.withColumn('Date', split_col.getItem(0))
df_time = df.withColumn('Time', split_col.getItem(1))

Any help guys????

like image 896
ben Avatar asked Mar 20 '17 14:03

ben


People also ask

How do you split a timestamp in spark?

As the date and time can come in any format, the right way of doing this is to convert the date strings to a Datetype() and them extract Date and Time part from it. The Date and Time parts can be extracted as follows any format. You can project these two into separate dataframes if you want.

How do I convert a timestamp to a date in PySpark?

The to_date() function in Apache PySpark is popularly used to convert Timestamp to the date. This is mostly achieved by truncating the Timestamp column's time part. The to_date() function takes TimeStamp as it's input in the default format of "MM-dd-yyyy HH:mm:ss. SSS".

How do I get the date part of a timestamp?

In order to get the date from the timestamp, you can use DATE() function from MySQL.


2 Answers

As the date and time can come in any format, the right way of doing this is to convert the date strings to a Datetype() and them extract Date and Time part from it.

Let take the below sample data

server_times = sc.parallelize([('1/20/2016 3:20:30 PM',),
                     ('1/20/2016 3:20:31 PM',),
                     ('1/20/2016 3:20:32 PM',)]).toDF(['ServerTime'])

The Date and Time parts can be extracted as follows any format.

from pyspark.sql.functions import unix_timestamp, from_unixtime, date_format

df.select(unix_timestamp(df.ServerTime, 'm/d/yyyy h:m:ss a').alias('ut'))\
  .select(from_unixtime('ut').alias('dty'))\
  .select(date_format('dty', 'M/d/yyyy').alias('Date'),
          date_format('dty', 'h:m:s a').alias('Time'))\
  .show()

+---------+----------+
|     Date|      Time|
+---------+----------+
|1/20/2016|3:20:30 PM|
|1/20/2016|3:20:31 PM|
|1/20/2016|3:20:32 PM|
+---------+----------+

You can project these two into separate dataframes if you want.

like image 145
Rags Avatar answered Oct 06 '22 00:10

Rags


You could use pyspark.sql.functions.concat to concatenate the relevant time bits together again. Let's first create some test data:

df = sc.parallelize([('1/20/2016 3:20:30 PM',),
                     ('1/20/2016 3:20:31 PM',),
                     ('1/20/2016 3:20:32 PM',)]).toDF(['ServerTime'])    

You can do this:

import pyspark.sql.functions as F
split_col = pyspark.sql.functions.split(df['ServerTime'], ' ')
df_date = df.withColumn('Date', split_col.getItem(0))
df_time = df.withColumn('Time', F.concat(split_col.getItem(1),F.lit(' '),split_col.getItem(2)))

After running df_time.show(), the following output is returned:

+--------------------+----------+
|          ServerTime|      Time|
+--------------------+----------+
|1/20/2016 3:20:30 PM|3:20:30 PM|
|1/20/2016 3:20:31 PM|3:20:31 PM|
|1/20/2016 3:20:32 PM|3:20:32 PM|
+--------------------+----------+

Running df_date.show() returns:

+--------------------+---------+
|          ServerTime|     Date|
+--------------------+---------+
|1/20/2016 3:20:30 PM|1/20/2016|
|1/20/2016 3:20:31 PM|1/20/2016|
|1/20/2016 3:20:32 PM|1/20/2016|
+--------------------+---------+
like image 20
Alex Avatar answered Oct 06 '22 00:10

Alex