I would like to replace values with column labels according to the largest 3 values for each row. Let's assume this input:
   p1  p2  p3  p4
0   0   9   1   4
1   0   2   3   4
2   1   3  10   7
3   1   5   3   1
4   2   3   7  10
Given n = 3, I am looking for:
  Top1 Top2 Top3
0   p2   p4   p3
1   p4   p3   p2
2   p3   p4   p2
3   p2   p3   p1
4   p4   p3   p2
I'm not concerned about duplicates, e.g. for index 3, Top3 can be 'p1' or 'p4'.
My first attempt is a full sort using np.ndarray.argsort:
res = pd.DataFrame(df.columns[df.values.argsort(1)]).iloc[:, len(df.index): 0: -1]
But in reality I have more than 4 columns and this will be inefficient.
Next I tried np.argpartition. But since values within each partition are not sorted, this required a subsequent sort:
n = 3
parts = np.argpartition(-df.values, n, axis=1)[:, :-1]
args = (-df.values[np.arange(df.shape[0])[:, None], parts]).argsort(1)
res = pd.DataFrame(df.columns[parts[np.arange(df.shape[0])[:, None], args]],
                   columns=[f'Top{i}' for i in range(1, n+1)])
This, in fact, works out slower than the first attempt for larger dataframes. Is there a more efficient way which takes advantage of partial sorting? You can use the below code for benchmarking purposes.
# Python 3.6.0, NumPy 1.11.3, Pandas 0.19.2
import pandas as pd, numpy as np
df = pd.DataFrame({'p1': [0, 0, 1, 1, 2],
                   'p2': [9, 2, 3, 5, 3],
                   'p3': [1, 3, 10, 3, 7],
                   'p4': [4, 4, 7, 1, 10]})
def full_sort(df):
    return pd.DataFrame(df.columns[df.values.argsort(1)]).iloc[:, len(df.index): 0: -1]
def partial_sort(df):
    n = 3
    parts = np.argpartition(-df.values, n, axis=1)[:, :-1]
    args = (-df.values[np.arange(df.shape[0])[:, None], parts]).argsort(1)
    return pd.DataFrame(df.columns[parts[np.arange(df.shape[0])[:, None], args]])
df = pd.concat([df]*10**5)
%timeit full_sort(df)     # 86.3 ms per loop
%timeit partial_sort(df)  # 158 ms per loop
                With a decent number of columns, we can use np.argpartition with some slicing and indexing, like so -
def topN_perrow_colsindexed(df, N):
    # Extract array data
    a = df.values
    # Get top N indices per row with not necessarily sorted order
    idxtopNpart = np.argpartition(a,-N,axis=1)[:,-1:-N-1:-1]
    # Index into input data with those and use argsort to force sorted order
    sidx = np.take_along_axis(a,idxtopNpart,axis=1).argsort(1)
    idxtopN = np.take_along_axis(idxtopNpart,sidx[:,::-1],axis=1)    
    # Index into column values with those for final output
    c = df.columns.values
    return pd.DataFrame(c[idxtopN], columns=[['Top'+str(i+1) for i in range(N)]])
Sample run -
In [65]: df
Out[65]: 
   p1  p2  p3  p4
0   0   9   1   4
1   0   2   3   4
2   1   3  10   7
3   1   5   3   1
4   2   3   7  10
In [66]: topN_perrow_colsindexed(df, N=3)
Out[66]: 
  Top1 Top2 Top3
0   p2   p4   p3
1   p4   p3   p2
2   p3   p4   p2
3   p2   p3   p4
4   p4   p3   p2
Timings -
In [143]: np.random.seed(0)
In [144]: df = pd.DataFrame(np.random.rand(10000,30))
In [145]: %timeit full_sort(df)
     ...: %timeit partial_sort(df)
     ...: %timeit topN_perrow_colsindexed(df,N=3)
100 loops, best of 3: 7.96 ms per loop
100 loops, best of 3: 13.9 ms per loop
100 loops, best of 3: 5.47 ms per loop
In [146]: df = pd.DataFrame(np.random.rand(10000,100))
In [147]: %timeit full_sort(df)
     ...: %timeit partial_sort(df)
     ...: %timeit topN_perrow_colsindexed(df,N=3)
10 loops, best of 3: 34 ms per loop
10 loops, best of 3: 56.1 ms per loop
100 loops, best of 3: 13.6 ms per loop
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With