Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort Array of structs in Spark DataFrame

Consider the following dataframe:

case class ArrayElement(id:Long,value:Double)

val df = Seq(
  Seq(
    ArrayElement(1L,-2.0),ArrayElement(2L,1.0),ArrayElement(0L,0.0)
  )
).toDF("arr")

df.printSchema

root
 |-- arr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: long (nullable = false)
 |    |    |-- value: double (nullable = false)

Is there a way to sort arr by value other than using an udf?

I've seen org.apache.spark.sql.functions.sort_array, what is this method actually doing in the case of complex array elements? Is it sorting the array by the first element (i.e. id?)

like image 690
Raphael Roth Avatar asked Nov 27 '17 09:11

Raphael Roth


People also ask

How do I sort an array in PySpark?

Collection function: sorts the input array in ascending order. The elements of the input array must be orderable. Null elements will be placed at the end of the returned array.

How do I sort a Spark data frame?

In Spark, we can use either sort() or orderBy() function of DataFrame/Dataset to sort by ascending or descending order based on single or multiple columns, you can also do sorting using Spark SQL sorting functions like asc_nulls_first(), asc_nulls_last(), desc_nulls_first(), desc_nulls_last().

What is transform in PySpark?

PySpark RDD Transformations are lazy evaluation and is used to transform/update from one RDD into another. When executed on RDD, it results in a single or multiple new RDD.

How do you create an array in PySpark?

Create PySpark ArrayType You can create an instance of an ArrayType using ArraType() class, This takes arguments valueType and one optional argument valueContainsNull to specify if a value can accept null, by default it takes True. valueType should be a PySpark type that extends DataType class.


1 Answers

spark functions says "Sorts the input array for the given column in ascending order, according to the natural ordering of the array elements."

Before I explain, lets look at some examples of what sort_array does.

+----------------------------+----------------------------+
|arr                         |sorted                      |
+----------------------------+----------------------------+
|[[1,-2.0], [2,1.0], [0,0.0]]|[[0,0.0], [1,-2.0], [2,1.0]]|
+----------------------------+----------------------------+

+----------------------------+----------------------------+
|arr                         |sorted                      |
+----------------------------+----------------------------+
|[[0,-2.0], [2,1.0], [0,0.0]]|[[0,-2.0], [0,0.0], [2,1.0]]|
+----------------------------+----------------------------+

+-----------------------------+-----------------------------+
|arr                          |sorted                       |
+-----------------------------+-----------------------------+
|[[0,-2.0], [2,1.0], [-1,0.0]]|[[-1,0.0], [0,-2.0], [2,1.0]]|
+-----------------------------+-----------------------------+

so sort_array is sorting by checking on the first element and then second and so on for each element in an array in the defined column

I hope its clear

like image 168
Ramesh Maharjan Avatar answered Sep 20 '22 03:09

Ramesh Maharjan