How do I sort a dataframe by an array not in the dataframe

Question

I've answered this question several times in the guise of different contexts and I realized that there isn't a good canonical approach specified anywhere.

So, to set up a simple problem:

Problem

df = pd.DataFrame(dict(A=range(6), B=[1, 2] * 3))
print(df)

   A  B
0  0  1
1  1  2
2  2  1
3  3  2
4  4  1
5  5  2

Question:

How do I sort by the product of columns 'A' and 'B'?
Here is an approach where I add a temporary column to the dataframe, use it to sort_values then drop it.

df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)

   A  B
0  0  1
1  1  2
2  2  1
4  4  1
3  3  2
5  5  2

Is there a better, more concise, clearer, more consistent approach?

piRSquared · Accepted Answer

TL;DR
iloc + argsort

We can approach this using iloc where we can take an array of ordinal positions and return the dataframe reordered by these positions.

With the power of iloc, we can sort with any array that specifies the order.

Now, all we need to do is identify a method for getting this ordering. Turns out there is a method called argsort which does exactly this. By passing the results of argsort to iloc, we can get our dataframe sorted out.

Example 1

Using the specified problem above

df.iloc[df.prod(1).argsort()]

Same results as above

That was for simplicity. We could take this further if performance is an issue and focus on numpy

v = df.values
a = v.prod(1).argsort()
pd.DataFrame(v[a], df.index[a], df.columns)

How fast are these solutions?

enter image description here

We can see that pd_ext_sort is the most concise but does not scale as well as the others.
np_ext_sort gives the best performance at the expense of transparency. Though, I'd argue that it's still very clear what is going on.

backtest setup

def add_drop():
    return df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)

def pd_ext_sort():
    return df.iloc[df.prod(1).argsort()]

def np_ext_sort():
    v = df.values
    a = v.prod(1).argsort()
    return pd.DataFrame(v[a], df.index[a], df.columns)

results = pd.DataFrame(
    index=pd.Index([10, 100, 1000, 10000], name='Size'),
    columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)

for i in results.index:
    df = pd.DataFrame(np.random.rand(i, 2), columns=['A', 'B'])
    for j in results.columns:
        stmt = '{}()'.format(j)
        setup = 'from __main__ import df, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setup, number=100))

results.plot()

Example 2

Suppose I have a column of negative and positive values. I want to sort by increasing magnitude... however, I want the negatives to come first.

Suppose I have dataframe df

df = pd.DataFrame(dict(A=range(-2, 3)))
print(df)

   A
0 -2
1 -1
2  0
3  1
4  2

I'll set up 3 versions again. This time I'll use np.lexsort which returns the same type of array as argsort. Meaning, I can use it to reorder the dataframe.

Caveat: np.lexsort sorts by the last array in its list first. \shurg

def add_drop():
    return df.assign(P=df.A >= 0, M=df.A.abs()).sort_values(['P', 'M']).drop(['P', 'M'], 1)

def pd_ext_sort():
    v = df.A.values
    return df.iloc[np.lexsort([np.abs(v), v >= 0])]

def np_ext_sort():
    v = df.A.values
    a = np.lexsort([np.abs(v), v >= 0])
    return pd.DataFrame(v[a, None], df.index[a], df.columns)

All of which return

How fast this time?

enter image description here

In this example, both pd_ext_sort and np_ext_sort outperformed add_drop.

backtest setup

results = pd.DataFrame(
    index=pd.Index([10, 100, 1000, 10000], name='Size'),
    columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)

for i in results.index:
    df = pd.DataFrame(np.random.randn(i, 1), columns=['A'])
    for j in results.columns:
        stmt = '{}()'.format(j)
        setup = 'from __main__ import df, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setup, number=100))

results.plot(figsize=(15, 6))

How do I sort a dataframe by an array not in the dataframe

Tags:

python

sorting

pandas

numpy

Problem

Question:

piRSquared

1 Answers

Example 1

Example 2

piRSquared

Recent Activity

Donate For Us

How do I sort a dataframe by an array not in the dataframe

Tags:

python

sorting

pandas

numpy

Problem

Question:

piRSquared

1 Answers

Example 1

Example 2

piRSquared

Related questions

Recent Activity

Donate For Us