Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas matching algorithm

Tags:

python

pandas

Say, I have a dataframe like this one:

                     affinity
applicant_id job_id          
1            a              7
             b              7
             c              5
2            a              0
             b              4
             c              2
3            a              4
             b              8
             c              1

I need to match each applicant to a job so that (a) higher affinity is preferred; (b) no applicant is matched to more than one job; (c) no job is matched to more than one applicant. So in the example above I would like to get

                     affinity
applicant_id job_id          
3            b              8
1            a              7
2            c              2

The best I can think of is

tmp = candidates.sort_values(ascending=False).copy()
matches = []
while len(tmp):
    (applicant, job), affinity = next(tmp.iteritems())
    matches.append((applicant, job))
    tmp = tmp.loc[(tmp.index.get_level_values('applicant_id') != applicant)
                  & (tmp.index.get_level_values('job_id') != job)]
candidates.reindex(matches)

Can this be achieved in pandas without explicit iteration?

like image 387
BlindDriver Avatar asked Apr 02 '26 16:04

BlindDriver


1 Answers

This is the typical linear sum assignment problem.

We'll make matrix, filling the missing values with some absurdly high penalty, so that they shouldn't ever get matched. A job will only appear in this matrix if at least one worker has an affinity for it, so this will work.

Sample Data

from scipy import optimize
import pandas as pd

df = pd.DataFrame({'applicant_id': [1]*3 + [2]*3 + [3]*3 + [4],
                   'job_id': ['a', 'b', 'c']*3 + ['h'],
                   'affinity': [7,7,5,0,4,2,4,8,1,10]})

Code

df1 = df.pivot(index='applicant_id', columns='job_id', values='affinity').fillna(-10**8)
#job_id                  a            b            c            h
#applicant_id                                                    
#1                     7.0          7.0          5.0 -100000000.0
#2                     0.0          4.0          2.0 -100000000.0
#3                     4.0          8.0          1.0 -100000000.0
#4            -100000000.0 -100000000.0 -100000000.0         10.0

opt = optimize.linear_sum_assignment(df1.to_numpy()*-1)
pd.DataFrame(df1.lookup(df1.index[opt[0]], df1.columns[opt[1]]), 
             columns=['affinity'],
             index=pd.MultiIndex.from_arrays([df1.index[opt[0]], df1.columns[opt[1]]]))

Output:

                     affinity
applicant_id job_id          
1            a            7.0
2            c            2.0
3            b            8.0
4            h           10.0

With more jobs than people we assign everyone, but some jobs remain unfilled. With more people than jobs, some low-affinity people remain unassigned.

like image 87
ALollz Avatar answered Apr 04 '26 07:04

ALollz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!