Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark dataframe convert unusual string format to Timestamp

I am using PySpark through Spark 1.5.0. I have an unusual String format in rows of a column for datetime values. It looks like this:

Row[(datetime='2016_08_21 11_31_08')] 

Is there a way to convert this unorthodox yyyy_mm_dd hh_mm_dd format into a Timestamp? Something that can eventually come along the lines of

df = df.withColumn("date_time",df.datetime.astype('Timestamp')) 

I had thought that Spark SQL functions like regexp_replace could work, but of course I need to replace _ with - in the date half and _ with : in the time part.

I was thinking I could split the column in 2 using substring and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way?

like image 601
PR102012 Avatar asked Aug 22 '16 20:08

PR102012


People also ask

How do I change the format of a timestamp in PySpark?

The to_date() function in Apache PySpark is popularly used to convert Timestamp to the date. This is mostly achieved by truncating the Timestamp column's time part. The to_date() function takes TimeStamp as it's input in the default format of "MM-dd-yyyy HH:mm:ss. SSS".

How do you convert a string to a date in PySpark?

PySpark to_date() – Convert String to Date Format to_date() – function is used to format string ( StringType ) to date ( DateType ) column. This function takes the first argument as a date string and the second argument takes the pattern the date is in the first argument.

How do I change date format in PySpark?

In PySpark use date_format() function to convert the DataFrame column from Date to String format.


1 Answers

Spark >= 2.2

from pyspark.sql.functions import to_timestamp  (sc     .parallelize([Row(dt='2016_08_21 11_31_08')])     .toDF()     .withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd HH_mm_ss"))     .show(1, False))  ## +-------------------+-------------------+ ## |dt                 |parsed             | ## +-------------------+-------------------+ ## |2016_08_21 11_31_08|2016-08-21 11:31:08| ## +-------------------+-------------------+ 

Spark < 2.2

It is nothing that unix_timestamp cannot handle:

from pyspark.sql import Row from pyspark.sql.functions import unix_timestamp  (sc     .parallelize([Row(dt='2016_08_21 11_31_08')])     .toDF()     .withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd HH_mm_ss")     # For Spark <= 1.5     # See issues.apache.org/jira/browse/SPARK-11724      .cast("double")     .cast("timestamp"))     .show(1, False))  ## +-------------------+---------------------+ ## |dt                 |parsed               | ## +-------------------+---------------------+ ## |2016_08_21 11_31_08|2016-08-21 11:31:08.0| ## +-------------------+---------------------+ 

In both cases the format string should be compatible with Java SimpleDateFormat.

like image 107
zero323 Avatar answered Sep 23 '22 05:09

zero323