How to sort a column with Date and time values in Spark?

Tags:

Note: I have this as a Dataframe in spark. This Time/Date values constitute a single column in the Dataframe.

Input:

04-NOV-16 03.36.13.000000000 PM
06-NOV-15 03.42.21.000000000 PM
05-NOV-15 03.32.05.000000000 PM
06-NOV-15 03.32.14.000000000 AM

Expected Output:

05-NOV-15 03.32.05.000000000 PM
06-NOV-15 03.32.14.000000000 AM
06-NOV-15 03.42.21.000000000 PM
04-NOV-16 03.36.13.000000000 PM

814

asked Nov 17 '16 12:11

Dasarathy D R

1 Answers

As this format is not standard, you need to use the unix_timestamp function to parse the string and convert into a timestamp type:

import org.apache.spark.sql.functions._

// Example data
val df = Seq(
  Tuple1("04-NOV-16 03.36.13.000000000 PM"),
  Tuple1("06-NOV-15 03.42.21.000000000 PM"),
  Tuple1("05-NOV-15 03.32.05.000000000 PM"),
  Tuple1("06-NOV-15 03.32.14.000000000 AM")
).toDF("stringCol")

// Timestamp pattern found in string
val pattern = "dd-MMM-yy hh.mm.ss.S a"

// Creating new DataFrame and ordering
val newDF = df
  .withColumn("timestampCol", unix_timestamp(df("stringCol"), pattern).cast("timestamp"))
  .orderBy("timestampCol")

newDF.show(false)

Result:

+-------------------------------+---------------------+
|stringCol                      |timestampCol         |
+-------------------------------+---------------------+
|05-NOV-15 03.32.05.000000000 PM|2015-11-05 15:32:05.0|
|06-NOV-15 03.32.14.000000000 AM|2015-11-06 03:32:14.0|
|06-NOV-15 03.42.21.000000000 PM|2015-11-06 15:42:21.0|
|04-NOV-16 03.36.13.000000000 PM|2016-11-04 15:36:13.0|
+-------------------------------+---------------------+

More about the unix_timestamp and other utility functions can be found here.

For building the timestamp format, one can refer to the SimpleDateFormatter docs

Edit 1: as said by pheeleeppoo, you could order directly by the expression, instead of creating a new column, assuming you want to keep only the string-typed column in your dataframe:

val newDF = df.orderBy(unix_timestamp(df("stringCol"), pattern).cast("timestamp"))

Edit 2: Please note that the precision of the unix_timestamp function is in seconds, so if the milliseconds are really important, an udf can be used:

def myUDF(p: String) = udf(
  (value: String) => {
    val dateFormat = new SimpleDateFormat(p)
    val parsedDate = dateFormat.parse(value)
    new java.sql.Timestamp(parsedDate.getTime())
  }
)

val pattern = "dd-MMM-yy hh.mm.ss.S a"
val newDF = df.withColumn("timestampCol", myUDF(pattern)(df("stringCol"))).orderBy("timestampCol")

184

answered Oct 22 '22 09:10

Daniel de Paula

Related questions
                            
                                Sort Spark Dataframe with two columns in different order
                            
                                take top N after groupBy and treat them as RDD
                            
                                use an external library in pyspark job in a Spark cluster from google-dataproc
                            
                                Converting a vector column in a dataframe back into an array column
                            
                                Remove an element from a Python list of lists in PySpark DataFrame
                            
                                How to flatten tuples in Spark?
                            
                                scala generic encoder for spark case class
                            
                                PySpark - Get indices of duplicate rows
                            
                                org.apache.spark.SparkException: Task not serializable
                            
                                NoClassDefFound : Scala/xml/metadata
                            
                                Column filtering in PySpark
                            
                                'yarn application -list' doesnt show any results
                            
                                Convert RDD to Dataframe in Spark/Scala
                            
                                Explicit cast reading .csv with case class Spark 2.1.0
                            
                                spark - scala - save dataframe to a table with overwrite mode
                            
                                spark foreachPartition, how to get an index of each partition?
                            
                                What is the result of RDD transformation in Spark?
                            
                                Detected Guava issue #1635 which indicates that a version of Guava less than 16.01 is in use
                            
                                pyspark error: 'DataFrame' object has no attribute 'map'
                            
                                Which One is faster? Spark SQL with Where clause or Use of Filter in Dataframe after Spark SQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to sort a column with Date and time values in Spark?

Tags:

dataframe

apache-spark

rdd

apache-spark-sql

Dasarathy D R

People also ask

1 Answers

Daniel de Paula

Recent Activity

Donate For Us