How to find column-index of top-n values within each row of huge dataframe

Tags:

I have a dataframe of format: (example data)

      Metric1  Metric2  Metric3  Metric4  Metric5
ID    
1     0.5      0.3      0.2      0.8      0.7    
2     0.1      0.8      0.5      0.2      0.4    
3     0.3      0.1      0.7      0.4      0.2    
4     0.9      0.4      0.8      0.5      0.2

where score range between [0,1] and I wish to generate a function that, for each id (row), calculates the top n metrics, where n is an input of the function along with the original dataframe.

My ideal output would be:(for eg. n = 3)

      Top_1     Top_2     Top_3
ID    
1     Metric4   Metric5   Metric1    
2     Metric2   Metric3   Metric5    
3     Metric3   Metric4   Metric1    
4     Metric1   Metric3   Metric4

Now I have written a function that does work:

def top_n_partners(scores,top_n=3):
metrics = np.array(scores.columns)
records=[]
for rec in scores.to_records():
    rec = list(rec)
    ID = rec[0]
    score_vals = rec[1:]
    inds = np.argsort(score_vals)
    top_metrics = metrics[inds][::-1]
    dic = {
        'top_score_%s' % (i+1):top_metrics[i]
        for i in range(top_n)
    }
    dic['ID'] = ID
    records.append(dic)
top_n_df = pd.DataFrame(records)
top_n_df.set_index('ID',inplace=True)
return top_n_df

However it seems rather inefficient/slow especially for the volume of data I'd be running this over (dataframe with millions of rows) and I was wondering if there was a smarter way to go about this?

261

asked Jun 01 '17 12:06

tfcoe

2 Answers

You can use numpy.argsort:

print (np.argsort(-df.values, axis=1)[:,:3])
[[3 4 0]
 [1 2 4]
 [2 3 0]
 [0 2 3]]

print (df.columns[np.argsort(-df.values, axis=1)[:,:3]])

Index([['Metric4', 'Metric5', 'Metric1'], ['Metric2', 'Metric3', 'Metric5'],
       ['Metric3', 'Metric4', 'Metric1'], ['Metric1', 'Metric3', 'Metric4']],
      dtype='object')

df = pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)[:,:3]], 
                               index=df.index)
df = df.rename(columns = lambda x: 'Top_{}'.format(x + 1))
print (df)
      Top_1    Top_2    Top_3
ID                           
1   Metric4  Metric5  Metric1
2   Metric2  Metric3  Metric5
3   Metric3  Metric4  Metric1
4   Metric1  Metric3  Metric4

Thank you Divakar for improving:

n = 3
df = pd.DataFrame(df.columns[df.values.argsort(1)[:,-n+2:1:-1]], 
                               index=df.index)

df = df.rename(columns = lambda x: 'Top_{}'.format(x + 1))
print (df)
      Top_1    Top_2    Top_3
ID                           
1   Metric4  Metric5  Metric1
2   Metric2  Metric3  Metric5
3   Metric3  Metric4  Metric1
4   Metric1  Metric3  Metric4

162

answered Oct 30 '22 15:10

jezrael

A different way using Pandas reshaping:

df.set_index('ID', inplace=True)
df_out = df.rank(axis=1, ascending=False).astype(int).reset_index().melt(id_vars='ID').query('value <= 3').pivot(index='ID',columns='value')
df_out.columns = df_out.columns.droplevel().astype(str)
df_out = df_out.add_prefix('Top_')
print(df_out)

Output:

value    Top_1    Top_2    Top_3
ID                              
1      Metric4  Metric5  Metric1
2      Metric2  Metric3  Metric5
3      Metric3  Metric4  Metric1
4      Metric1  Metric3  Metric4

answered Oct 30 '22 15:10

Scott Boston

Related questions
                            
                                How to convert a timedelta to a string and back again
                            
                                Renaming columns on DataFrame output of pandas.concat
                            
                                Using Scipy curve_fit with piecewise function
                            
                                Cloning Conda root environment does not clone conda and condo-build
                            
                                Why does shuffling my validation set in Keras change my model's performance?
                            
                                Symbol not found: _sqlite3_enable_load_extension - sqlite installed via homebrew
                            
                                Preserving quotes in ruamel.yaml
                            
                                python numpy: how to construct a big diagonal array(matrix) from two small array
                            
                                Json parsing Python subprocess
                            
                                How to dynamically import modules?
                            
                                Making a list and appending to it in TensorFlow
                            
                                ANSI color lost when using python subprocess [closed]
                            
                                Pandas: How to use LocIndexer?
                            
                                How to remove an data/models from nltk dowloader?
                            
                                What is the meaning of angle brackets in Python?
                            
                                Can I handle multiple asserts within a single Python pytest method?
                            
                                NumPy ndarray.all() vs np.all(ndarray) vs all(ndarray)
                            
                                Python - Getting and setting clipboard data with subprocesses
                            
                                Using cross validation and AUC-ROC for a logistic regression model in sklearn
                            
                                Python imaplib selecting folders

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find column-index of top-n values within each row of huge dataframe

Tags:

python

sorting

pandas

rank

top-n

tfcoe

People also ask

2 Answers

jezrael

Scott Boston

Recent Activity

Donate For Us