Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sort a column with Date and time values in Spark?

Note: I have this as a Dataframe in spark. This Time/Date values constitute a single column in the Dataframe.

Input:

04-NOV-16 03.36.13.000000000 PM
06-NOV-15 03.42.21.000000000 PM
05-NOV-15 03.32.05.000000000 PM
06-NOV-15 03.32.14.000000000 AM

Expected Output:

05-NOV-15 03.32.05.000000000 PM
06-NOV-15 03.32.14.000000000 AM
06-NOV-15 03.42.21.000000000 PM
04-NOV-16 03.36.13.000000000 PM
like image 814
Dasarathy D R Avatar asked Nov 17 '16 12:11

Dasarathy D R


People also ask

How do I sort a date column in PySpark?

You can use either sort() or orderBy() function of PySpark DataFrame to sort DataFrame by ascending or descending order based on single or multiple columns, you can also do sorting using PySpark SQL sorting functions, In this article, I will explain all these different ways using PySpark examples.

How do you sort a data frame Spark?

You can use either sort() or orderBy() built-in functions to sort a particular DataFrame in ascending or descending order over at least one column.

How does Spark sort by value?

sortBy() is used to sort the data by value efficiently in pyspark. It is a method available in rdd. It uses a lambda expression to sort the data based on columns.

What is the difference between orderBy and sort by in Spark?

Description. The SORT BY clause is used to return the result rows sorted within each partition in the user specified order. When there is more than one partition SORT BY may return result that is partially ordered. This is different than ORDER BY clause which guarantees a total order of the output.


1 Answers

As this format is not standard, you need to use the unix_timestamp function to parse the string and convert into a timestamp type:

import org.apache.spark.sql.functions._

// Example data
val df = Seq(
  Tuple1("04-NOV-16 03.36.13.000000000 PM"),
  Tuple1("06-NOV-15 03.42.21.000000000 PM"),
  Tuple1("05-NOV-15 03.32.05.000000000 PM"),
  Tuple1("06-NOV-15 03.32.14.000000000 AM")
).toDF("stringCol")

// Timestamp pattern found in string
val pattern = "dd-MMM-yy hh.mm.ss.S a"

// Creating new DataFrame and ordering
val newDF = df
  .withColumn("timestampCol", unix_timestamp(df("stringCol"), pattern).cast("timestamp"))
  .orderBy("timestampCol")

newDF.show(false)

Result:

+-------------------------------+---------------------+
|stringCol                      |timestampCol         |
+-------------------------------+---------------------+
|05-NOV-15 03.32.05.000000000 PM|2015-11-05 15:32:05.0|
|06-NOV-15 03.32.14.000000000 AM|2015-11-06 03:32:14.0|
|06-NOV-15 03.42.21.000000000 PM|2015-11-06 15:42:21.0|
|04-NOV-16 03.36.13.000000000 PM|2016-11-04 15:36:13.0|
+-------------------------------+---------------------+

More about the unix_timestamp and other utility functions can be found here.

For building the timestamp format, one can refer to the SimpleDateFormatter docs


Edit 1: as said by pheeleeppoo, you could order directly by the expression, instead of creating a new column, assuming you want to keep only the string-typed column in your dataframe:

val newDF = df.orderBy(unix_timestamp(df("stringCol"), pattern).cast("timestamp"))

Edit 2: Please note that the precision of the unix_timestamp function is in seconds, so if the milliseconds are really important, an udf can be used:

def myUDF(p: String) = udf(
  (value: String) => {
    val dateFormat = new SimpleDateFormat(p)
    val parsedDate = dateFormat.parse(value)
    new java.sql.Timestamp(parsedDate.getTime())
  }
)

val pattern = "dd-MMM-yy hh.mm.ss.S a"
val newDF = df.withColumn("timestampCol", myUDF(pattern)(df("stringCol"))).orderBy("timestampCol")
like image 184
Daniel de Paula Avatar answered Oct 22 '22 09:10

Daniel de Paula