Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find column-index of top-n values within each row of huge dataframe

I have a dataframe of format: (example data)

      Metric1  Metric2  Metric3  Metric4  Metric5
ID    
1     0.5      0.3      0.2      0.8      0.7    
2     0.1      0.8      0.5      0.2      0.4    
3     0.3      0.1      0.7      0.4      0.2    
4     0.9      0.4      0.8      0.5      0.2    

where score range between [0,1] and I wish to generate a function that, for each id (row), calculates the top n metrics, where n is an input of the function along with the original dataframe.

My ideal output would be:(for eg. n = 3)

      Top_1     Top_2     Top_3
ID    
1     Metric4   Metric5   Metric1    
2     Metric2   Metric3   Metric5    
3     Metric3   Metric4   Metric1    
4     Metric1   Metric3   Metric4  

Now I have written a function that does work:

def top_n_partners(scores,top_n=3):
metrics = np.array(scores.columns)
records=[]
for rec in scores.to_records():
    rec = list(rec)
    ID = rec[0]
    score_vals = rec[1:]
    inds = np.argsort(score_vals)
    top_metrics = metrics[inds][::-1]
    dic = {
        'top_score_%s' % (i+1):top_metrics[i]
        for i in range(top_n)
    }
    dic['ID'] = ID
    records.append(dic)
top_n_df = pd.DataFrame(records)
top_n_df.set_index('ID',inplace=True)
return top_n_df

However it seems rather inefficient/slow especially for the volume of data I'd be running this over (dataframe with millions of rows) and I was wondering if there was a smarter way to go about this?

like image 261
tfcoe Avatar asked Jun 01 '17 12:06

tfcoe


People also ask

How do I select a column in a Dataframe based on index?

Often you may want to select the columns of a pandas DataFrame based on their index value. If you’d like to select columns based on integer indexing, you can use the .iloc function. If you’d like to select columns based on label indexing, you can use the .loc function.

Which rows have index values greater than 7 in the column?

This tells us that the rows with index values 3, 4, 5, and 6 have a value greater than ‘7’ in the points column. The following code shows how to get the index of the rows where one column is equal to a certain string:

How to get the index of rows in a pandas Dataframe?

You can use the following syntax to get the index of rows in a pandas DataFrame whose column matches specific values: df. index [df[' column_name ']== value ]. tolist () The following examples show how to use this syntax in practice with the following pandas DataFrame:

How to get topmost N values from the top in pandas Dataframe?

Firstly, we created a pandas dataframe: Now, we will get topmost N values of each group of the ‘Variables’ column. Here reset_index () is used to provide a new index according to the grouping of data. And head () is used to get topmost N values from the top.


2 Answers

You can use numpy.argsort:

print (np.argsort(-df.values, axis=1)[:,:3])
[[3 4 0]
 [1 2 4]
 [2 3 0]
 [0 2 3]]

print (df.columns[np.argsort(-df.values, axis=1)[:,:3]])

Index([['Metric4', 'Metric5', 'Metric1'], ['Metric2', 'Metric3', 'Metric5'],
       ['Metric3', 'Metric4', 'Metric1'], ['Metric1', 'Metric3', 'Metric4']],
      dtype='object')

df = pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)[:,:3]], 
                               index=df.index)
df = df.rename(columns = lambda x: 'Top_{}'.format(x + 1))
print (df)
      Top_1    Top_2    Top_3
ID                           
1   Metric4  Metric5  Metric1
2   Metric2  Metric3  Metric5
3   Metric3  Metric4  Metric1
4   Metric1  Metric3  Metric4 

Thank you Divakar for improving:

n = 3
df = pd.DataFrame(df.columns[df.values.argsort(1)[:,-n+2:1:-1]], 
                               index=df.index)

df = df.rename(columns = lambda x: 'Top_{}'.format(x + 1))
print (df)
      Top_1    Top_2    Top_3
ID                           
1   Metric4  Metric5  Metric1
2   Metric2  Metric3  Metric5
3   Metric3  Metric4  Metric1
4   Metric1  Metric3  Metric4                
like image 162
jezrael Avatar answered Oct 30 '22 15:10

jezrael


A different way using Pandas reshaping:

df.set_index('ID', inplace=True)
df_out = df.rank(axis=1, ascending=False).astype(int).reset_index().melt(id_vars='ID').query('value <= 3').pivot(index='ID',columns='value')
df_out.columns = df_out.columns.droplevel().astype(str)
df_out = df_out.add_prefix('Top_')
print(df_out)

Output:

value    Top_1    Top_2    Top_3
ID                              
1      Metric4  Metric5  Metric1
2      Metric2  Metric3  Metric5
3      Metric3  Metric4  Metric1
4      Metric1  Metric3  Metric4
like image 43
Scott Boston Avatar answered Oct 30 '22 15:10

Scott Boston