Consider the following dataframe:
case class ArrayElement(id:Long,value:Double)
val df = Seq(
Seq(
ArrayElement(1L,-2.0),ArrayElement(2L,1.0),ArrayElement(0L,0.0)
)
).toDF("arr")
df.printSchema
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = false)
| | |-- value: double (nullable = false)
Is there a way to sort arr
by value
other than using an udf?
I've seen org.apache.spark.sql.functions.sort_array
, what is this method actually doing in the case of complex array elements? Is it sorting the array by the first element (i.e. id
?)
Collection function: sorts the input array in ascending order. The elements of the input array must be orderable. Null elements will be placed at the end of the returned array.
In Spark, we can use either sort() or orderBy() function of DataFrame/Dataset to sort by ascending or descending order based on single or multiple columns, you can also do sorting using Spark SQL sorting functions like asc_nulls_first(), asc_nulls_last(), desc_nulls_first(), desc_nulls_last().
PySpark RDD Transformations are lazy evaluation and is used to transform/update from one RDD into another. When executed on RDD, it results in a single or multiple new RDD.
Create PySpark ArrayType You can create an instance of an ArrayType using ArraType() class, This takes arguments valueType and one optional argument valueContainsNull to specify if a value can accept null, by default it takes True. valueType should be a PySpark type that extends DataType class.
spark functions says "Sorts the input array for the given column in ascending order, according to the natural ordering of the array elements."
Before I explain, lets look at some examples of what sort_array does.
+----------------------------+----------------------------+
|arr |sorted |
+----------------------------+----------------------------+
|[[1,-2.0], [2,1.0], [0,0.0]]|[[0,0.0], [1,-2.0], [2,1.0]]|
+----------------------------+----------------------------+
+----------------------------+----------------------------+
|arr |sorted |
+----------------------------+----------------------------+
|[[0,-2.0], [2,1.0], [0,0.0]]|[[0,-2.0], [0,0.0], [2,1.0]]|
+----------------------------+----------------------------+
+-----------------------------+-----------------------------+
|arr |sorted |
+-----------------------------+-----------------------------+
|[[0,-2.0], [2,1.0], [-1,0.0]]|[[-1,0.0], [0,-2.0], [2,1.0]]|
+-----------------------------+-----------------------------+
so sort_array is sorting by checking on the first element and then second and so on for each element in an array in the defined column
I hope its clear
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With