Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I sort a dataframe by an array not in the dataframe

I've answered this question several times in the guise of different contexts and I realized that there isn't a good canonical approach specified anywhere.

So, to set up a simple problem:

Problem

df = pd.DataFrame(dict(A=range(6), B=[1, 2] * 3))
print(df)

   A  B
0  0  1
1  1  2
2  2  1
3  3  2
4  4  1
5  5  2

Question:

How do I sort by the product of columns 'A' and 'B'?
Here is an approach where I add a temporary column to the dataframe, use it to sort_values then drop it.

df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)

   A  B
0  0  1
1  1  2
2  2  1
4  4  1
3  3  2
5  5  2

Is there a better, more concise, clearer, more consistent approach?

like image 313
piRSquared Avatar asked Feb 05 '23 18:02

piRSquared


1 Answers

TL;DR
iloc + argsort


We can approach this using iloc where we can take an array of ordinal positions and return the dataframe reordered by these positions.

With the power of iloc, we can sort with any array that specifies the order.

Now, all we need to do is identify a method for getting this ordering. Turns out there is a method called argsort which does exactly this. By passing the results of argsort to iloc, we can get our dataframe sorted out.

Example 1

Using the specified problem above

df.iloc[df.prod(1).argsort()]

Same results as above

   A  B
0  0  1
1  1  2
2  2  1
4  4  1
3  3  2
5  5  2

That was for simplicity. We could take this further if performance is an issue and focus on numpy

v = df.values
a = v.prod(1).argsort()
pd.DataFrame(v[a], df.index[a], df.columns)

How fast are these solutions?

enter image description here

We can see that pd_ext_sort is the most concise but does not scale as well as the others.
np_ext_sort gives the best performance at the expense of transparency. Though, I'd argue that it's still very clear what is going on.

backtest setup

def add_drop():
    return df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)

def pd_ext_sort():
    return df.iloc[df.prod(1).argsort()]

def np_ext_sort():
    v = df.values
    a = v.prod(1).argsort()
    return pd.DataFrame(v[a], df.index[a], df.columns)

results = pd.DataFrame(
    index=pd.Index([10, 100, 1000, 10000], name='Size'),
    columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)

for i in results.index:
    df = pd.DataFrame(np.random.rand(i, 2), columns=['A', 'B'])
    for j in results.columns:
        stmt = '{}()'.format(j)
        setup = 'from __main__ import df, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setup, number=100))

results.plot()

Example 2

Suppose I have a column of negative and positive values. I want to sort by increasing magnitude... however, I want the negatives to come first.

Suppose I have dataframe df

df = pd.DataFrame(dict(A=range(-2, 3)))
print(df)

   A
0 -2
1 -1
2  0
3  1
4  2

I'll set up 3 versions again. This time I'll use np.lexsort which returns the same type of array as argsort. Meaning, I can use it to reorder the dataframe.

Caveat: np.lexsort sorts by the last array in its list first. \shurg

def add_drop():
    return df.assign(P=df.A >= 0, M=df.A.abs()).sort_values(['P', 'M']).drop(['P', 'M'], 1)

def pd_ext_sort():
    v = df.A.values
    return df.iloc[np.lexsort([np.abs(v), v >= 0])]

def np_ext_sort():
    v = df.A.values
    a = np.lexsort([np.abs(v), v >= 0])
    return pd.DataFrame(v[a, None], df.index[a], df.columns)

All of which return

   A
1 -1
0 -2
2  0
3  1
4  2

How fast this time?

enter image description here

In this example, both pd_ext_sort and np_ext_sort outperformed add_drop.

backtest setup

results = pd.DataFrame(
    index=pd.Index([10, 100, 1000, 10000], name='Size'),
    columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)

for i in results.index:
    df = pd.DataFrame(np.random.randn(i, 1), columns=['A'])
    for j in results.columns:
        stmt = '{}()'.format(j)
        setup = 'from __main__ import df, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setup, number=100))

results.plot(figsize=(15, 6))
like image 109
piRSquared Avatar answered Feb 07 '23 06:02

piRSquared