Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas sorting by value and then by index

Tags:

python

pandas

I have the following dataset:

import numpy as np
from pandas import DataFrame
import numpy.random as random

random.seed(12)

df = DataFrame (
    {
        "fac1" : ["a","a","a","a","b","b","b","b"] ,
        "val" : random.choice(np.arange(0,20), 8, replace=False)
    }
)
df2 = df.set_index(["fac1"])
df2

enter image description here

What I want is to sort by val within each fac1 group, to produce this:

enter image description here

I have combed the documentation and cannot find a straightforward way. The best I could do was the following hack:

df3 = df2.reset_index()
df4 = df3.sort_values(["fac1","val"],ascending=[True,True],axis=0)
df5 = df4.set_index(["fac1"])
df5
# Produces the picture above

(I realize the above could benefit from multiple inplace options, just doing it this way to make intermediate products clear).

I did find this SO post, which uses grouping and a sorting function. However the following code, adapted from that post, produced an incorrect result:

df2.groupby("fac1",axis=1).apply(lambda x : x.sort_values("val"))

(Output removed for space considerations)

Is there another way to approach this?

Update: Solution

The accepted solution is:

df2.sort_values(by='val').sort_index(kind='mergesort')

The sorting algorithm must be mergesort and it must be explicitly specified as it is not the default. As the sort_index documentation points out, "mergesort is the only stable algorithm." Here's another sample dataset that will not sort properly if you don't specify mergesort for kind:

random.seed(12)

len = 32 

df = DataFrame (
    {
        "fac1" : ["a" for i in range(int(len/2))] + ["b" for i in range(int(len/2))] ,
        "val" : random.choice(np.arange(0,100), len, replace=False)
    }
)
df2 = df.set_index(["fac1"])
df2.sort_values(by='val').sort_index()

(Am omitting all outputs for space consideration)

like image 293
sparc_spread Avatar asked Nov 13 '15 18:11

sparc_spread


1 Answers

EDIT: I looked into the documentation and the default sorting algorithm for sort_index is quicksort. This is NOT a "stable" algorithm, in that it does not preserve "the input order of equal elements in the sorted output" (from Wikipedia). However, sort_index gives you the option to choose "mergesort", which IS a stable sorting algorithm. So the fact that my original answer,

df2.sort_values(by='val').sort_index()

, worked, was simply happenstance. This code should work every time, since it uses a stable sorting algorithm:

df2.sort_values(by='val').sort_index(kind = 'mergesort')
like image 199
Sam Avatar answered Nov 16 '22 05:11

Sam