I've answered this question several times in the guise of different contexts and I realized that there isn't a good canonical approach specified anywhere.
So, to set up a simple problem:
df = pd.DataFrame(dict(A=range(6), B=[1, 2] * 3))
print(df)
A B
0 0 1
1 1 2
2 2 1
3 3 2
4 4 1
5 5 2
How do I sort by the product of columns 'A' and 'B'?
Here is an approach where I add a temporary column to the dataframe, use it to sort_values then drop it.
df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)
A B
0 0 1
1 1 2
2 2 1
4 4 1
3 3 2
5 5 2
Is there a better, more concise, clearer, more consistent approach?
TL;DRiloc + argsort
We can approach this using iloc where we can take an array of ordinal positions and return the dataframe reordered by these positions.
With the power of iloc, we can sort with any array that specifies the order.
Now, all we need to do is identify a method for getting this ordering. Turns out there is a method called argsort which does exactly this. By passing the results of argsort to iloc, we can get our dataframe sorted out.
Using the specified problem above
df.iloc[df.prod(1).argsort()]
Same results as above
A B
0 0 1
1 1 2
2 2 1
4 4 1
3 3 2
5 5 2
That was for simplicity. We could take this further if performance is an issue and focus on numpy
v = df.values
a = v.prod(1).argsort()
pd.DataFrame(v[a], df.index[a], df.columns)
How fast are these solutions?

We can see that pd_ext_sort is the most concise but does not scale as well as the others.np_ext_sort gives the best performance at the expense of transparency. Though, I'd argue that it's still very clear what is going on.
backtest setup
def add_drop():
return df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)
def pd_ext_sort():
return df.iloc[df.prod(1).argsort()]
def np_ext_sort():
v = df.values
a = v.prod(1).argsort()
return pd.DataFrame(v[a], df.index[a], df.columns)
results = pd.DataFrame(
index=pd.Index([10, 100, 1000, 10000], name='Size'),
columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)
for i in results.index:
df = pd.DataFrame(np.random.rand(i, 2), columns=['A', 'B'])
for j in results.columns:
stmt = '{}()'.format(j)
setup = 'from __main__ import df, {}'.format(j)
results.set_value(i, j, timeit(stmt, setup, number=100))
results.plot()
Suppose I have a column of negative and positive values. I want to sort by increasing magnitude... however, I want the negatives to come first.
Suppose I have dataframe df
df = pd.DataFrame(dict(A=range(-2, 3)))
print(df)
A
0 -2
1 -1
2 0
3 1
4 2
I'll set up 3 versions again. This time I'll use np.lexsort which returns the same type of array as argsort. Meaning, I can use it to reorder the dataframe.
Caveat: np.lexsort sorts by the last array in its list first. \shurg
def add_drop():
return df.assign(P=df.A >= 0, M=df.A.abs()).sort_values(['P', 'M']).drop(['P', 'M'], 1)
def pd_ext_sort():
v = df.A.values
return df.iloc[np.lexsort([np.abs(v), v >= 0])]
def np_ext_sort():
v = df.A.values
a = np.lexsort([np.abs(v), v >= 0])
return pd.DataFrame(v[a, None], df.index[a], df.columns)
All of which return
A
1 -1
0 -2
2 0
3 1
4 2
How fast this time?

In this example, both pd_ext_sort and np_ext_sort outperformed add_drop.
backtest setup
results = pd.DataFrame(
index=pd.Index([10, 100, 1000, 10000], name='Size'),
columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)
for i in results.index:
df = pd.DataFrame(np.random.randn(i, 1), columns=['A'])
for j in results.columns:
stmt = '{}()'.format(j)
setup = 'from __main__ import df, {}'.format(j)
results.set_value(i, j, timeit(stmt, setup, number=100))
results.plot(figsize=(15, 6))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With