Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark DataFrame groupBy and sort in the descending order (pyspark)

I'm using pyspark(Python 2.7.9/Spark 1.3.1) and have a dataframe GroupObject which I need to filter & sort in the descending order. Trying to achieve it via this piece of code.

group_by_dataframe.count().filter("`count` >= 10").sort('count', ascending=False) 

But it throws the following error.

sort() got an unexpected keyword argument 'ascending' 
like image 670
rclakmal Avatar asked Dec 29 '15 15:12

rclakmal


People also ask

How do you sort PySpark DataFrame in descending order?

You can use either sort() or orderBy() function of PySpark DataFrame to sort DataFrame by ascending or descending order based on single or multiple columns, you can also do sorting using PySpark SQL sorting functions, In this article, I will explain all these different ways using PySpark examples.

How does PySpark sort grouped data?

We will sort the table using the sort() function in which we will access the column using the col() function and desc() function to sort it in descending order.

How do you sort a DataFrame based on a column in PySpark?

We can use either orderBy() or sort() method to sort the data in the dataframe. Pass asc() to sort the data in ascending order; otherwise, desc(). We can do this based on a single column or multiple columns.

What is the difference between orderBy and sort by in Spark?

sort() is more efficient compared to orderBy() because the data is sorted on each partition individually and this is why the order in the output data is not guaranteed. On the other hand, orderBy() collects all the data into a single executor and then sorts them.


1 Answers

In PySpark 1.3 sort method doesn't take ascending parameter. You can use desc method instead:

from pyspark.sql.functions import col  (group_by_dataframe     .count()     .filter("`count` >= 10")     .sort(col("count").desc())) 

or desc function:

from pyspark.sql.functions import desc  (group_by_dataframe     .count()     .filter("`count` >= 10")     .sort(desc("count")) 

Both methods can be used with with Spark >= 1.3 (including Spark 2.x).

like image 144
zero323 Avatar answered Oct 12 '22 21:10

zero323