I am using PySpark through Spark 1.5.0. I have an unusual String format in rows of a column for datetime values. It looks like this:
Row[(datetime='2016_08_21 11_31_08')]
Is there a way to convert this unorthodox yyyy_mm_dd hh_mm_dd
format into a Timestamp? Something that can eventually come along the lines of
df = df.withColumn("date_time",df.datetime.astype('Timestamp'))
I had thought that Spark SQL functions like regexp_replace
could work, but of course I need to replace _
with -
in the date half and _
with :
in the time part.
I was thinking I could split the column in 2 using substring
and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way?
The to_date() function in Apache PySpark is popularly used to convert Timestamp to the date. This is mostly achieved by truncating the Timestamp column's time part. The to_date() function takes TimeStamp as it's input in the default format of "MM-dd-yyyy HH:mm:ss. SSS".
PySpark to_date() – Convert String to Date Format to_date() – function is used to format string ( StringType ) to date ( DateType ) column. This function takes the first argument as a date string and the second argument takes the pattern the date is in the first argument.
In PySpark use date_format() function to convert the DataFrame column from Date to String format.
Spark >= 2.2
from pyspark.sql.functions import to_timestamp (sc .parallelize([Row(dt='2016_08_21 11_31_08')]) .toDF() .withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd HH_mm_ss")) .show(1, False)) ## +-------------------+-------------------+ ## |dt |parsed | ## +-------------------+-------------------+ ## |2016_08_21 11_31_08|2016-08-21 11:31:08| ## +-------------------+-------------------+
Spark < 2.2
It is nothing that unix_timestamp
cannot handle:
from pyspark.sql import Row from pyspark.sql.functions import unix_timestamp (sc .parallelize([Row(dt='2016_08_21 11_31_08')]) .toDF() .withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd HH_mm_ss") # For Spark <= 1.5 # See issues.apache.org/jira/browse/SPARK-11724 .cast("double") .cast("timestamp")) .show(1, False)) ## +-------------------+---------------------+ ## |dt |parsed | ## +-------------------+---------------------+ ## |2016_08_21 11_31_08|2016-08-21 11:31:08.0| ## +-------------------+---------------------+
In both cases the format string should be compatible with Java SimpleDateFormat
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With