Note: I have this as a Dataframe in spark. This Time/Date values constitute a single column in the Dataframe.
Input:
04-NOV-16 03.36.13.000000000 PM
06-NOV-15 03.42.21.000000000 PM
05-NOV-15 03.32.05.000000000 PM
06-NOV-15 03.32.14.000000000 AM
Expected Output:
05-NOV-15 03.32.05.000000000 PM
06-NOV-15 03.32.14.000000000 AM
06-NOV-15 03.42.21.000000000 PM
04-NOV-16 03.36.13.000000000 PM
You can use either sort() or orderBy() function of PySpark DataFrame to sort DataFrame by ascending or descending order based on single or multiple columns, you can also do sorting using PySpark SQL sorting functions, In this article, I will explain all these different ways using PySpark examples.
You can use either sort() or orderBy() built-in functions to sort a particular DataFrame in ascending or descending order over at least one column.
sortBy() is used to sort the data by value efficiently in pyspark. It is a method available in rdd. It uses a lambda expression to sort the data based on columns.
Description. The SORT BY clause is used to return the result rows sorted within each partition in the user specified order. When there is more than one partition SORT BY may return result that is partially ordered. This is different than ORDER BY clause which guarantees a total order of the output.
As this format is not standard, you need to use the unix_timestamp function to parse the string and convert into a timestamp type:
import org.apache.spark.sql.functions._
// Example data
val df = Seq(
Tuple1("04-NOV-16 03.36.13.000000000 PM"),
Tuple1("06-NOV-15 03.42.21.000000000 PM"),
Tuple1("05-NOV-15 03.32.05.000000000 PM"),
Tuple1("06-NOV-15 03.32.14.000000000 AM")
).toDF("stringCol")
// Timestamp pattern found in string
val pattern = "dd-MMM-yy hh.mm.ss.S a"
// Creating new DataFrame and ordering
val newDF = df
.withColumn("timestampCol", unix_timestamp(df("stringCol"), pattern).cast("timestamp"))
.orderBy("timestampCol")
newDF.show(false)
Result:
+-------------------------------+---------------------+
|stringCol |timestampCol |
+-------------------------------+---------------------+
|05-NOV-15 03.32.05.000000000 PM|2015-11-05 15:32:05.0|
|06-NOV-15 03.32.14.000000000 AM|2015-11-06 03:32:14.0|
|06-NOV-15 03.42.21.000000000 PM|2015-11-06 15:42:21.0|
|04-NOV-16 03.36.13.000000000 PM|2016-11-04 15:36:13.0|
+-------------------------------+---------------------+
More about the unix_timestamp and other utility functions can be found here.
For building the timestamp format, one can refer to the SimpleDateFormatter docs
Edit 1: as said by pheeleeppoo, you could order directly by the expression, instead of creating a new column, assuming you want to keep only the string-typed column in your dataframe:
val newDF = df.orderBy(unix_timestamp(df("stringCol"), pattern).cast("timestamp"))
Edit 2: Please note that the precision of the unix_timestamp function is in seconds, so if the milliseconds are really important, an udf can be used:
def myUDF(p: String) = udf(
(value: String) => {
val dateFormat = new SimpleDateFormat(p)
val parsedDate = dateFormat.parse(value)
new java.sql.Timestamp(parsedDate.getTime())
}
)
val pattern = "dd-MMM-yy hh.mm.ss.S a"
val newDF = df.withColumn("timestampCol", myUDF(pattern)(df("stringCol"))).orderBy("timestampCol")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With