Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use orderby() with descending order in Spark window functions?

I need a window function that partitions by some keys (=column names), orders by another column name and returns the rows with top x ranks.

This works fine for ascending order:

def getTopX(df: DataFrame, top_x: String, top_key: String, top_value:String): DataFrame ={
    val top_keys: List[String] = top_key.split(", ").map(_.trim).toList
    val w = Window.partitionBy(top_keys(1),top_keys.drop(1):_*)
       .orderBy(top_value)
    val rankCondition = "rn < "+top_x.toString
    val dfTop = df.withColumn("rn",row_number().over(w))
      .where(rankCondition).drop("rn")
  return dfTop
}

But when I try to change it to orderBy(desc(top_value)) or orderBy(top_value.desc) in line 4, I get a syntax error. What's the correct syntax here?

like image 651
Malte Avatar asked Jul 25 '16 16:07

Malte


People also ask

How do I order descending order from Spark?

In order to sort by descending order in Spark DataFrame, we can use desc property of the Column class or desc() sql function.

How do you sort a DataFrame in descending order in Spark?

In Spark, we can use either sort() or orderBy() function of DataFrame/Dataset to sort by ascending or descending order based on single or multiple columns, you can also do sorting using Spark SQL sorting functions like asc_nulls_first(), asc_nulls_last(), desc_nulls_first(), desc_nulls_last().

How do you sort descending in Pyspark?

The Desc method is used to order the elements in descending order. By default the sorting technique used is in Ascending order, so by the use of Desc method, we can sort the element in Descending order in a PySpark Data Frame. The orderBy clause is used to return the row in a sorted Manner.

How do I sort rows in Pyspark?

From your dataframe, you may need create an index first. If you want to sort all data based on rows, i would suggest you just to transpose all the data, sorts it, and transpose it back again. You may refer on how to transpose df in pyspark.


3 Answers

There are two versions of orderBy, one that works with strings and one that works with Column objects (API). Your code is using the first version, which does not allow for changing the sort order. You need to switch to the column version and then call the desc method, e.g., myCol.desc.

Now, we get into API design territory. The advantage of passing Column parameters is that you have a lot more flexibility, e.g., you can use expressions, etc. If you want to maintain an API that takes in a string as opposed to a Column, you need to convert the string to a column. There are a number of ways to do this and the easiest is to use org.apache.spark.sql.functions.col(myColName).

Putting it all together, we get

.orderBy(org.apache.spark.sql.functions.col(top_value).desc)
like image 184
Sim Avatar answered Oct 14 '22 05:10

Sim


Say for example, if we need to order by a column called Date in descending order in the Window function, use the $ symbol before the column name which will enable us to use the asc or desc syntax.

Window.orderBy($"Date".desc)

After specifying the column name in double quotes, give .desc which will sort in descending order.

like image 31
Sarath KS Avatar answered Oct 14 '22 04:10

Sarath KS


Column

col = new Column("ts")
col = col.desc()
WindowSpec w = Window.partitionBy("col1", "col2").orderBy(col)
like image 1
GPopat Avatar answered Oct 14 '22 03:10

GPopat