I've successfully create a row_number()
partitionBy
by in Spark using Window, but would like to sort this by descending, instead of the default ascending. Here is my working code:
from pyspark import HiveContext
from pyspark.sql.types import *
from pyspark.sql import Row, functions as F
from pyspark.sql.window import Window
data_cooccur.select("driver", "also_item", "unit_count",
F.rowNumber().over(Window.partitionBy("driver").orderBy("unit_count")).alias("rowNum")).show()
That gives me this result:
+------+---------+----------+------+
|driver|also_item|unit_count|rowNum|
+------+---------+----------+------+
| s10| s11| 1| 1|
| s10| s13| 1| 2|
| s10| s17| 1| 3|
And here I add the desc() to order descending:
data_cooccur.select("driver", "also_item", "unit_count", F.rowNumber().over(Window.partitionBy("driver").orderBy("unit_count").desc()).alias("rowNum")).show()
And get this error:
AttributeError: 'WindowSpec' object has no attribute 'desc'
What am I doing wrong here?
In order to sort by descending order in Spark DataFrame, we can use desc property of the Column class or desc() sql function.
The Desc method is used to order the elements in descending order. By default the sorting technique used is in Ascending order, so by the use of Desc method, we can sort the element in Descending order in a PySpark Data Frame. The orderBy clause is used to return the row in a sorted Manner.
The row_number() is a window function in Spark SQL that assigns a row number (sequential integer number) to each row in the result DataFrame. This function is used with Window. partitionBy() which partitions the data into windows frames and orderBy() clause to sort the rows in each partition.
In Spark/PySpark, you can use show() action to get the top/first N (5,10,100 ..) rows of the DataFrame and display them on a console or a log, there are also several Spark Actions like take() , tail() , collect() , head() , first() that return top and last n rows as a list of Rows (Array[Row] for Scala).
desc
should be applied on a column not a window definition. You can use either a method on a column:
from pyspark.sql.functions import col, row_number from pyspark.sql.window import Window F.row_number().over( Window.partitionBy("driver").orderBy(col("unit_count").desc()) )
or a standalone function:
from pyspark.sql.functions import desc from pyspark.sql.window import Window F.row_number().over( Window.partitionBy("driver").orderBy(desc("unit_count")) )
Or you can use the SQL code in Spark-SQL:
from pyspark.sql import SparkSession spark = SparkSession\ .builder\ .master('local[*]')\ .appName('Test')\ .getOrCreate() spark.sql(""" select driver ,also_item ,unit_count ,ROW_NUMBER() OVER (PARTITION BY driver ORDER BY unit_count DESC) AS rowNum from data_cooccur """).show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With