I've got a very large pandas df that has the fields group, id_a, id_b and score, ordered by score so the highest is at the top. There is a row for every possible combination of id_a and id_b. I want to extract rows so that there is only one row per id_a and id_b, which reflects the highest score possible without repeating IDs.
Showing an example of what this might look like - the resulting df has 3 rows, with all ids in id_a and id_b appearing once each. In the case of A2/B2 and A1/B1, the row with the best score for both IDs has been used. In the case of A3, the best score related to a row with B1, which had already been used, so the next best score combined with B3 is used.
Input table

Desired result

To achieve this, I've got a loop iterating through the original df. This is incredibly slow with a large dataset. I've tried coming up with alternatives but I am struggling, for e.g.:
Can anybody offer any other approaches? Thank you!
import pandas as pd
# Create sample df
group = [1, 1, 1, 1, 1, 1, 1, 1, 1]
id_a = ['A2', 'A1', 'A3', 'A3', 'A2', 'A1', 'A1', 'A2', 'A3']
id_b = ['B2', 'B1', 'B1', 'B3', 'B1', 'B2', 'B3', 'B3', 'B2']
score = [0.99, 0.98, 0.97, 0.96, 0.93, 0.5, 0.41, 0.4, 0.2]
df = pd.DataFrame({'group': group, 'id_a': id_a, 'id_b': id_b, 'score': score})
result = pd.DataFrame(columns=df.columns)
# Extract required rows
for i, row in df.iterrows():
if len(result) == 0:
result = row.to_frame().T
else:
if ((row['id_a'] in result['id_a'].tolist())
or (row['id_b'] in result['id_b'].tolist())):
continue
else:
result = pd.concat([result, row.to_frame().T[result.columns]])
This looks to me line a linear_sum_assignment problem (i.e. maximizing the sum of score for unique pairs of id_a/id_b), you could use a pivot per group to do this:
from scipy.optimize import linear_sum_assignment
out = {}
for group, g in (df.pivot(index=['group', 'id_a'],
columns='id_b', values='score')
.groupby(level=0)):
idx, col = linear_sum_assignment(g, maximize=True)
out[group] = pd.DataFrame({'id_a': g.index.get_level_values(1)[idx],
'id_b': g.columns[col],
'score': g.values[idx, col]})
out = pd.concat(out, names=['group']).reset_index(0)
Alternative code (does the same thing with a different order of the steps). This way is preferred if you don't have the same id_a/id_b across groups (this avoids building a large pivoted intermediate):
from scipy.optimize import linear_sum_assignment
import numpy as np
out = []
for group, g in df.sort_values(by=['id_a', 'id_b']).groupby('group'):
tmp = g.pivot(index='id_a', columns='id_b', values='score')
idx, col = linear_sum_assignment(tmp, maximize=True)
out.append(g.iloc[np.ravel_multi_index((idx, col), tmp.shape)])
out = pd.concat(out)
Output:
group id_a id_b score
0 1 A1 B1 0.98
1 1 A2 B2 0.99
2 1 A3 B3 0.96
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With