I am using PySpark through Spark 1.5.0. I have an unusual String format in rows of a column for datetime values. It looks like this: <pre class="prettyprint"><code>Row[(datetime='2016_08_21 11_31_08')] </code></pre> Is there a way to convert this unorthodox <code>yyyy_mm_dd hh_mm_dd</code> format into a Timestamp? Something that can eventually come along the lines of <pre class="prettyprint"><code>df = df.withColumn("date_time",df.datetime.astype('Timestamp')) </code></pre> I had thought that Spark SQL functions like <code>regexp_replace</code> could work, but of course I need to replace <code>_</code> with <code>-</code> in the date half and <code>_</code> with <code>:</code> in the time part. I was thinking I could split the column in 2 using <code>substring</code> and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way?

Spark >= 2.2 <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql.functions import to_timestamp (sc .parallelize([Row(dt='2016_08_21 11_31_08')]) .toDF() .withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd HH_mm_ss")) .show(1, False)) ## +-------------------+-------------------+ ## |dt |parsed | ## +-------------------+-------------------+ ## |2016_08_21 11_31_08|2016-08-21 11:31:08| ## +-------------------+-------------------+ </code></pre> Spark < 2.2 It is nothing that <code>unix_timestamp</code> cannot handle: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql import Row from pyspark.sql.functions import unix_timestamp (sc .parallelize([Row(dt='2016_08_21 11_31_08')]) .toDF() .withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd HH_mm_ss") # For Spark <= 1.5 # See issues.apache.org/jira/browse/SPARK-11724 .cast("double") .cast("timestamp")) .show(1, False)) ## +-------------------+---------------------+ ## |dt |parsed | ## +-------------------+---------------------+ ## |2016_08_21 11_31_08|2016-08-21 11:31:08.0| ## +-------------------+---------------------+ </code></pre> In both cases the format string should be compatible with Java <code>SimpleDateFormat</code>.

PySpark dataframe convert unusual string format to Timestamp

Tags:

timestamp

dataframe

apache-spark

apache-spark-sql

pyspark

I am using PySpark through Spark 1.5.0. I have an unusual String format in rows of a column for datetime values. It looks like this:

Row[(datetime='2016_08_21 11_31_08')]

Is there a way to convert this unorthodox yyyy_mm_dd hh_mm_dd format into a Timestamp? Something that can eventually come along the lines of

df = df.withColumn("date_time",df.datetime.astype('Timestamp'))

I had thought that Spark SQL functions like regexp_replace could work, but of course I need to replace _ with - in the date half and _ with : in the time part.

I was thinking I could split the column in 2 using substring and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way?

601

asked Aug 22 '16 20:08

PR102012

1 Answers

Spark >= 2.2

from pyspark.sql.functions import to_timestamp  (sc     .parallelize([Row(dt='2016_08_21 11_31_08')])     .toDF()     .withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd HH_mm_ss"))     .show(1, False))  ## +-------------------+-------------------+ ## |dt                 |parsed             | ## +-------------------+-------------------+ ## |2016_08_21 11_31_08|2016-08-21 11:31:08| ## +-------------------+-------------------+

Spark < 2.2

It is nothing that unix_timestamp cannot handle:

from pyspark.sql import Row from pyspark.sql.functions import unix_timestamp  (sc     .parallelize([Row(dt='2016_08_21 11_31_08')])     .toDF()     .withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd HH_mm_ss")     # For Spark <= 1.5     # See issues.apache.org/jira/browse/SPARK-11724      .cast("double")     .cast("timestamp"))     .show(1, False))  ## +-------------------+---------------------+ ## |dt                 |parsed               | ## +-------------------+---------------------+ ## |2016_08_21 11_31_08|2016-08-21 11:31:08.0| ## +-------------------+---------------------+

In both cases the format string should be compatible with Java SimpleDateFormat.

107

answered Sep 23 '22 05:09

zero323

Related questions
                            
                                DataFrame / Dataset groupBy behaviour/optimization
                            
                                How to change memory per node for apache spark worker
                            
                                Change Executor Memory (and other configs) for Spark Shell
                            
                                How to convert List to JavaRDD
                            
                                Dealing with unbalanced datasets in Spark MLlib
                            
                                Spark DataFrame - Select n random rows
                            
                                How to create SparkSession from existing SparkContext
                            
                                How to sort an RDD in Scala Spark?
                            
                                map vs mapValues in Spark
                            
                                How do I use multiple conditions with pyspark.sql.functions.when()?
                            
                                Replace empty strings with None/null values in DataFrame
                            
                                Increase memory available to PySpark at runtime
                            
                                how to convert json string to dataframe on spark
                            
                                Difference in dense rank and row number in spark
                            
                                How to set Master address for Spark examples from command line
                            
                                Querying on multiple Hive stores using Apache Spark
                            
                                Concatenating datasets of different RDDs in Apache spark using scala
                            
                                How to know which piece of code runs on driver or executor?
                            
                                What is the difference between Spark Standalone, YARN and local mode?
                            
                                How to create correct data frame for classification in Spark ML

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With