Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sort by value efficiently in PySpark?

I want to sort my K,V tuples by V, i.e. by the value. I know that TakeOrdered is good for this if you know how many you need:

b = sc.parallelize([('t',3),('b',4),('c',1)])

Using TakeOrdered:

b.takeOrdered(3,lambda atuple: atuple[1])

Using Lambda

b.map(lambda aTuple: (aTuple[1], aTuple[0])).sortByKey().map(
    lambda aTuple: (aTuple[0], aTuple[1])).collect()

I've checked out the question here, which suggests the latter. I find it hard to believe that takeOrdered is so succinct and yet it requires the same number of operations as the Lambda solution.

Does anyone know of a simpler, more concise Transformation in spark to sort by value?

like image 226
makansij Avatar asked Nov 14 '15 08:11

makansij


People also ask

How do you sort values in PySpark?

You can use either sort() or orderBy() function of PySpark DataFrame to sort DataFrame by ascending or descending order based on single or multiple columns, you can also do sorting using PySpark SQL sorting functions, In this article, I will explain all these different ways using PySpark examples.

How do you sort values in descending order in PySpark?

We can use either orderBy() or sort() method to sort the data in the dataframe. Pass asc() to sort the data in ascending order; otherwise, desc(). We can do this based on a single column or multiple columns.

How does sort work in PySpark?

In PySpark, the DataFrame class provides a sort() function which is defined to sort on one or more columns and it sorts by ascending order by default. The PySpark DataFrame also provides the orderBy() function to sort on one or more columns. and it orders by ascending by default.

How do I sort multiple columns in PySpark?

Using sort() to sort multiple columns In Spark, We can use sort() function of the DataFrame to sort the multiple columns. If you wanted to ascending and descending, use asc and desc on Column.


2 Answers

Just wanted to add this tip.. which helped me out alot

Ascending:

bSorted = b.sortBy(lambda a: a[1])

Descending:

bSorted = b.sortBy(lambda a: -a[1])
like image 186
REZ Avatar answered Oct 11 '22 09:10

REZ


I think sortBy() is more concise:

b = sc.parallelize([('t', 3),('b', 4),('c', 1)])
bSorted = b.sortBy(lambda a: a[1])
bSorted.collect()
...
[('c', 1),('t', 3),('b', 4)]

It's actually not more efficient at all as it involves keying by the values, sorting by the keys, and then grabbing the values but it looks prettier than your latter solution. In terms of efficiency, I don't think you'll find a more efficient solution as you would need a way to transform your data such that values will be your keys (and then eventually transform that data back to the original schema).

like image 31
Rohan Aletty Avatar answered Oct 11 '22 08:10

Rohan Aletty