Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark SQL Row_number() PartitionBy Sort Desc

I've successfully create a row_number() partitionBy by in Spark using Window, but would like to sort this by descending, instead of the default ascending. Here is my working code:

from pyspark import HiveContext
from pyspark.sql.types import *
from pyspark.sql import Row, functions as F
from pyspark.sql.window import Window

data_cooccur.select("driver", "also_item", "unit_count", 
    F.rowNumber().over(Window.partitionBy("driver").orderBy("unit_count")).alias("rowNum")).show()

That gives me this result:

 +------+---------+----------+------+
 |driver|also_item|unit_count|rowNum|
 +------+---------+----------+------+
 |   s10|      s11|         1|     1|
 |   s10|      s13|         1|     2|
 |   s10|      s17|         1|     3|

And here I add the desc() to order descending:

data_cooccur.select("driver", "also_item", "unit_count", F.rowNumber().over(Window.partitionBy("driver").orderBy("unit_count").desc()).alias("rowNum")).show()

And get this error:

AttributeError: 'WindowSpec' object has no attribute 'desc'

What am I doing wrong here?

like image 732
jKraut Avatar asked Feb 06 '16 22:02

jKraut


People also ask

How do I sort descending in Spark?

In order to sort by descending order in Spark DataFrame, we can use desc property of the Column class or desc() sql function.

How do you sort descending in PySpark?

The Desc method is used to order the elements in descending order. By default the sorting technique used is in Ascending order, so by the use of Desc method, we can sort the element in Descending order in a PySpark Data Frame. The orderBy clause is used to return the row in a sorted Manner.

How do I assign row numbers in Spark?

The row_number() is a window function in Spark SQL that assigns a row number (sequential integer number) to each row in the result DataFrame. This function is used with Window. partitionBy() which partitions the data into windows frames and orderBy() clause to sort the rows in each partition.

How do you select top 5 rows in PySpark?

In Spark/PySpark, you can use show() action to get the top/first N (5,10,100 ..) rows of the DataFrame and display them on a console or a log, there are also several Spark Actions like take() , tail() , collect() , head() , first() that return top and last n rows as a list of Rows (Array[Row] for Scala).


2 Answers

desc should be applied on a column not a window definition. You can use either a method on a column:

from pyspark.sql.functions import col, row_number from pyspark.sql.window import Window  F.row_number().over(     Window.partitionBy("driver").orderBy(col("unit_count").desc()) ) 

or a standalone function:

from pyspark.sql.functions import desc from pyspark.sql.window import Window  F.row_number().over(     Window.partitionBy("driver").orderBy(desc("unit_count")) ) 
like image 52
zero323 Avatar answered Oct 02 '22 15:10

zero323


Or you can use the SQL code in Spark-SQL:

from pyspark.sql import SparkSession  spark = SparkSession\     .builder\     .master('local[*]')\     .appName('Test')\     .getOrCreate()  spark.sql("""     select driver         ,also_item         ,unit_count         ,ROW_NUMBER() OVER (PARTITION BY driver ORDER BY unit_count DESC) AS rowNum     from data_cooccur """).show() 
like image 32
kennyut Avatar answered Oct 02 '22 17:10

kennyut