I want to split timestamp value into date and time. eg: <pre class="prettyprint"><code>1/20/2016 3:20:30 PM 1/20/2016 3:20:31 PM 1/20/2016 3:20:32 PM 1/20/2016 3:20:32 PM 1/20/2016 3:20:32 PM 1/20/2016 3:20:33 PM 1/20/2016 3:20:34 PM 1/20/2016 3:20:34 PM </code></pre> needs to be split into 1/20/2016 and 3:20:30 PM using sql spilt function I am unable to process it correctly <pre class="prettyprint"><code>split_col = pyspark.sql.functions.split(df['ServerTime'], ' ') df_date = df.withColumn('Date', split_col.getItem(0)) df_time = df.withColumn('Time', split_col.getItem(1)) </code></pre> Any help guys????

As the date and time can come in any format, the right way of doing this is to convert the date strings to a Datetype() and them extract Date and Time part from it. Let take the below sample data <pre class="prettyprint"><code>server_times = sc.parallelize([('1/20/2016 3:20:30 PM',), ('1/20/2016 3:20:31 PM',), ('1/20/2016 3:20:32 PM',)]).toDF(['ServerTime']) </code></pre> The Date and Time parts can be extracted as follows any format. <pre class="prettyprint"><code>from pyspark.sql.functions import unix_timestamp, from_unixtime, date_format df.select(unix_timestamp(df.ServerTime, 'm/d/yyyy h:m:ss a').alias('ut'))\ .select(from_unixtime('ut').alias('dty'))\ .select(date_format('dty', 'M/d/yyyy').alias('Date'), date_format('dty', 'h:m:s a').alias('Time'))\ .show() +---------+----------+ | Date| Time| +---------+----------+ |1/20/2016|3:20:30 PM| |1/20/2016|3:20:31 PM| |1/20/2016|3:20:32 PM| +---------+----------+ </code></pre> You can project these two into separate dataframes if you want.

How can I split a timestamp column into date and time in spark

Tags:

pyspark

I want to split timestamp value into date and time.

eg:

1/20/2016 3:20:30 PM
1/20/2016 3:20:31 PM
1/20/2016 3:20:32 PM
1/20/2016 3:20:32 PM
1/20/2016 3:20:32 PM
1/20/2016 3:20:33 PM
1/20/2016 3:20:34 PM
1/20/2016 3:20:34 PM

needs to be split into 1/20/2016 and 3:20:30 PM

using sql spilt function I am unable to process it correctly

split_col = pyspark.sql.functions.split(df['ServerTime'], ' ')
df_date = df.withColumn('Date', split_col.getItem(0))
df_time = df.withColumn('Time', split_col.getItem(1))

Any help guys????

896

asked Mar 20 '17 14:03

ben

2 Answers

As the date and time can come in any format, the right way of doing this is to convert the date strings to a Datetype() and them extract Date and Time part from it.

Let take the below sample data

server_times = sc.parallelize([('1/20/2016 3:20:30 PM',),
                     ('1/20/2016 3:20:31 PM',),
                     ('1/20/2016 3:20:32 PM',)]).toDF(['ServerTime'])

The Date and Time parts can be extracted as follows any format.

from pyspark.sql.functions import unix_timestamp, from_unixtime, date_format

df.select(unix_timestamp(df.ServerTime, 'm/d/yyyy h:m:ss a').alias('ut'))\
  .select(from_unixtime('ut').alias('dty'))\
  .select(date_format('dty', 'M/d/yyyy').alias('Date'),
          date_format('dty', 'h:m:s a').alias('Time'))\
  .show()

+---------+----------+
|     Date|      Time|
+---------+----------+
|1/20/2016|3:20:30 PM|
|1/20/2016|3:20:31 PM|
|1/20/2016|3:20:32 PM|
+---------+----------+

You can project these two into separate dataframes if you want.

145

answered Oct 06 '22 00:10

Rags

You could use pyspark.sql.functions.concat to concatenate the relevant time bits together again. Let's first create some test data:

df = sc.parallelize([('1/20/2016 3:20:30 PM',),
                     ('1/20/2016 3:20:31 PM',),
                     ('1/20/2016 3:20:32 PM',)]).toDF(['ServerTime'])

You can do this:

import pyspark.sql.functions as F
split_col = pyspark.sql.functions.split(df['ServerTime'], ' ')
df_date = df.withColumn('Date', split_col.getItem(0))
df_time = df.withColumn('Time', F.concat(split_col.getItem(1),F.lit(' '),split_col.getItem(2)))

After running df_time.show(), the following output is returned:

+--------------------+----------+
|          ServerTime|      Time|
+--------------------+----------+
|1/20/2016 3:20:30 PM|3:20:30 PM|
|1/20/2016 3:20:31 PM|3:20:31 PM|
|1/20/2016 3:20:32 PM|3:20:32 PM|
+--------------------+----------+

Running df_date.show() returns:

+--------------------+---------+
|          ServerTime|     Date|
+--------------------+---------+
|1/20/2016 3:20:30 PM|1/20/2016|
|1/20/2016 3:20:31 PM|1/20/2016|
|1/20/2016 3:20:32 PM|1/20/2016|
+--------------------+---------+

answered Oct 06 '22 00:10

Alex

Related questions
                            
                                Spark 2.0 read csv number of partitions (PySpark)
                            
                                pyspark, Compare two rows in dataframe
                            
                                Issues with Logistic Regression for multiclass classification using PySpark
                            
                                turning pandas to pyspark expression
                            
                                How to enable Tungsten optimization in Spark 2?
                            
                                How to enable spark-history server for standalone cluster non hdfs mode
                            
                                AssertionError: all exprs should be Column
                            
                                TypeError: 'DataFrameReader' object is not callable
                            
                                Using when and otherwise while converting boolean values to strings in Pyspark
                            
                                Transpose a dataframe in Pyspark
                            
                                How to specify join types in AWS Glue?
                            
                                Pyspark KMeans clustering features column IllegalArgumentException
                            
                                Count occurrences of a list of substrings in a pyspark df column
                            
                                How to save csv files faster from pyspark dataframe?
                            
                                Pyspark Failed to find data source: kafka
                            
                                Pyspark: how to extract hour from timestamp
                            
                                SparkSQL sql syntax for nth item in array
                            
                                Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found (Spark 1.6 Windows)
                            
                                boto3 cannot create client on pyspark worker?
                            
                                Is it possible to filter Spark DataFrames to return all rows where a column value is in a list using pyspark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With