How to sort by value efficiently in PySpark?

Tags:

I want to sort my K,V tuples by V, i.e. by the value. I know that TakeOrdered is good for this if you know how many you need:

b = sc.parallelize([('t',3),('b',4),('c',1)])

Using TakeOrdered:

b.takeOrdered(3,lambda atuple: atuple[1])

Using Lambda

b.map(lambda aTuple: (aTuple[1], aTuple[0])).sortByKey().map(
    lambda aTuple: (aTuple[0], aTuple[1])).collect()

I've checked out the question here, which suggests the latter. I find it hard to believe that takeOrdered is so succinct and yet it requires the same number of operations as the Lambda solution.

Does anyone know of a simpler, more concise Transformation in spark to sort by value?

226

asked Nov 14 '15 08:11

2 Answers

Just wanted to add this tip.. which helped me out alot

Ascending:

bSorted = b.sortBy(lambda a: a[1])

Descending:

bSorted = b.sortBy(lambda a: -a[1])

186

answered Oct 11 '22 09:10

REZ

I think sortBy() is more concise:

b = sc.parallelize([('t', 3),('b', 4),('c', 1)])
bSorted = b.sortBy(lambda a: a[1])
bSorted.collect()
...
[('c', 1),('t', 3),('b', 4)]

It's actually not more efficient at all as it involves keying by the values, sorting by the keys, and then grabbing the values but it looks prettier than your latter solution. In terms of efficiency, I don't think you'll find a more efficient solution as you would need a way to transform your data such that values will be your keys (and then eventually transform that data back to the original schema).

answered Oct 11 '22 08:10

Rohan Aletty

Related questions
                            
                                Why does Python throw an error when a substring is not found?
                            
                                BeautifulSoup - TypeError: 'NoneType' object is not callable
                            
                                Python: count occurrences in a list using dict comprehension/generator
                            
                                How to structure a Python module to limit exported symbols?
                            
                                Read BSON file in Python?
                            
                                Fastest way to remove first and last lines from a Python string
                            
                                Provide tab title with reportlab generated pdf
                            
                                Getting all constants within a class in python
                            
                                datetime strptime - set format to ignore trailing part of string
                            
                                Raising elements of a list to a power [closed]
                            
                                Is there a way to define list(obj) method on a user defined class in python?
                            
                                Analytical solution for Linear Regression using Python vs. Julia
                            
                                multiprocessing.Pool with maxtasksperchild produces equal PIDs
                            
                                Python Enums with duplicate values
                            
                                Execute Python script from Php
                            
                                How to convert dictionary values to int in Python?
                            
                                Adjust width of box in boxplot in python matplotlib
                            
                                Flask Cache not caching
                            
                                Removing duplicates from Pandas dataFrame with condition for retaining original
                            
                                What is the smallest number which can be represented in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to sort by value efficiently in PySpark?

Tags:

python

sorting

lambda

apache-spark

makansij

People also ask

2 Answers

REZ

Rohan Aletty

Recent Activity

Donate For Us