I have a dataframe of format: (example data)
Metric1 Metric2 Metric3 Metric4 Metric5
ID
1 0.5 0.3 0.2 0.8 0.7
2 0.1 0.8 0.5 0.2 0.4
3 0.3 0.1 0.7 0.4 0.2
4 0.9 0.4 0.8 0.5 0.2
where score range between [0,1] and I wish to generate a function that, for each id (row), calculates the top n metrics, where n is an input of the function along with the original dataframe.
My ideal output would be:(for eg. n = 3)
Top_1 Top_2 Top_3
ID
1 Metric4 Metric5 Metric1
2 Metric2 Metric3 Metric5
3 Metric3 Metric4 Metric1
4 Metric1 Metric3 Metric4
Now I have written a function that does work:
def top_n_partners(scores,top_n=3):
metrics = np.array(scores.columns)
records=[]
for rec in scores.to_records():
rec = list(rec)
ID = rec[0]
score_vals = rec[1:]
inds = np.argsort(score_vals)
top_metrics = metrics[inds][::-1]
dic = {
'top_score_%s' % (i+1):top_metrics[i]
for i in range(top_n)
}
dic['ID'] = ID
records.append(dic)
top_n_df = pd.DataFrame(records)
top_n_df.set_index('ID',inplace=True)
return top_n_df
However it seems rather inefficient/slow especially for the volume of data I'd be running this over (dataframe with millions of rows) and I was wondering if there was a smarter way to go about this?
Often you may want to select the columns of a pandas DataFrame based on their index value. If you’d like to select columns based on integer indexing, you can use the .iloc function. If you’d like to select columns based on label indexing, you can use the .loc function.
This tells us that the rows with index values 3, 4, 5, and 6 have a value greater than ‘7’ in the points column. The following code shows how to get the index of the rows where one column is equal to a certain string:
You can use the following syntax to get the index of rows in a pandas DataFrame whose column matches specific values: df. index [df[' column_name ']== value ]. tolist () The following examples show how to use this syntax in practice with the following pandas DataFrame:
Firstly, we created a pandas dataframe: Now, we will get topmost N values of each group of the ‘Variables’ column. Here reset_index () is used to provide a new index according to the grouping of data. And head () is used to get topmost N values from the top.
You can use numpy.argsort
:
print (np.argsort(-df.values, axis=1)[:,:3])
[[3 4 0]
[1 2 4]
[2 3 0]
[0 2 3]]
print (df.columns[np.argsort(-df.values, axis=1)[:,:3]])
Index([['Metric4', 'Metric5', 'Metric1'], ['Metric2', 'Metric3', 'Metric5'],
['Metric3', 'Metric4', 'Metric1'], ['Metric1', 'Metric3', 'Metric4']],
dtype='object')
df = pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)[:,:3]],
index=df.index)
df = df.rename(columns = lambda x: 'Top_{}'.format(x + 1))
print (df)
Top_1 Top_2 Top_3
ID
1 Metric4 Metric5 Metric1
2 Metric2 Metric3 Metric5
3 Metric3 Metric4 Metric1
4 Metric1 Metric3 Metric4
Thank you Divakar for improving:
n = 3
df = pd.DataFrame(df.columns[df.values.argsort(1)[:,-n+2:1:-1]],
index=df.index)
df = df.rename(columns = lambda x: 'Top_{}'.format(x + 1))
print (df)
Top_1 Top_2 Top_3
ID
1 Metric4 Metric5 Metric1
2 Metric2 Metric3 Metric5
3 Metric3 Metric4 Metric1
4 Metric1 Metric3 Metric4
A different way using Pandas reshaping:
df.set_index('ID', inplace=True)
df_out = df.rank(axis=1, ascending=False).astype(int).reset_index().melt(id_vars='ID').query('value <= 3').pivot(index='ID',columns='value')
df_out.columns = df_out.columns.droplevel().astype(str)
df_out = df_out.add_prefix('Top_')
print(df_out)
Output:
value Top_1 Top_2 Top_3
ID
1 Metric4 Metric5 Metric1
2 Metric2 Metric3 Metric5
3 Metric3 Metric4 Metric1
4 Metric1 Metric3 Metric4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With