Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is pandas deciding order in a sort when there is a tie?

Tags:

python

pandas

Pandas 0.12.0

In the DataFrame below, why for example does it jumble the indexes? Look at the 4's, the indexes go from 1,15,6,7. What is the reasoning pandas is using to decide how to order, I would have suspected the indexes to remain sequential for an equal value.

mydf=pd.DataFrame(np.random.randint(1, 6, 20),columns=["stars"])
mydf.sort(['stars'], ascending=False)


     stars
19   5
14   5
1    4
15   4
6    4
7    4
4    3
12   3
18   3
8    2
2    2
9    2
10   2
11   2
13   2
16   2
5    1
3    1
17   1
0    1
like image 940
Brian Feeny Avatar asked Oct 25 '13 04:10

Brian Feeny


2 Answers

Actually, if you look into the source code of pandas DataFrame, you'll see that sort() is just a wrapper of sort_index() with different parameter, and, as @Jeff said in this question, sort_index() is prefered method to use.

The sort_index() method using numpy.argsort() with default kind=quicksort, if you're sorting only by one column. And quicksort() is not stable, that's why your index looks shuffled.

But you can pass kind parameter to sort_index() (one of 'mergesort', 'quicksort', 'heapsort'), so you can use stable sort ('mergesort') for your task:

>>> mydf.sort_index(by=['stars'], ascending=False, kind='mergesort')
    stars
17      5
11      5
6       5
1       5
19      4
18      4
15      4
14      4
7       4
5       4
2       4
10      3
8       3
4       3
16      2
12      2
9       2
3       2
13      1
0       1

sort_index() also using mergesort (or counting sort) if there're more that one column in by parameter, it's interesting, for example, you can do this:

>>> mydf.sort_index(by=['stars', 'stars'], ascending=False)
    stars
1       5
6       5
11      5
17      5
2       4
5       4
7       4
14      4
15      4
18      4
19      4
4       3
8       3
10      3
3       2
9       2
12      2
16      2
0       1
13      1

Now the sort is stable, but indexes are sorted ascending

like image 63
Roman Pekar Avatar answered Sep 20 '22 19:09

Roman Pekar


Pandas is using numpy's quicksort. Quicksort involves swapping positions of the items. It stops once they are in the requested order (which in this case, does not involve checking the indices because you didn't ask for that column to be checked). Quicksort is much more efficient than a naive sort algorithm such as bubble sort, which might be what you have in mind-- it would leave the individual numbers closer to their original order, but require more steps to do so.

like image 43
foobarbecue Avatar answered Sep 19 '22 19:09

foobarbecue