How to find set of lowest sum of distinct column elements in python?

Question

Problem statement: There are 5 projects and 15 employees and the numbers against each column shows the interest of each employee in a given project. Each project can have a maximum of 3 employees. Scores are from 1-5 1 being the highest preference and 5 being the lowest preference. I have to divide the employees among the projects in such a way that least number of people get dissatisfied or minimum score. Note that my algorithm creates all possible combinations and then sorts those combinations with sum in ascending order and picks up top 5 combinations with distinct employees.

But here is the problem, so, for example, my sortedsum matrix is [1,1,1,4,9,9,...] now this algorithmically is correct, but the thing is if I pick top 5 of them my total sum will be 16. But there could be a possibility where instead of taking [1,1,1,4,9] if I take [2,1,1,4] as the first four then the fifth project team sum goes to 3 and in that way the minimum will change and here's the point where my algorithm fails.

I have a 3nXn matrix, for this example, I will take it as a 15x5: So the matrix looks like this (https://i.stack.imgur.com/omcq7.png):

df = pd.read_csv(io.StringIO("""

employee  proj_A  proj_B  proj_C  proj_D  proj_E
      A1       1       5       3       4       2
      B1       5       4       1       2       3
      C1       2       3       4       1       5
      A2       4       2       1       3       5
      B2       4       5       3       2       1
      C2       3       1       2       5       4
      A3       1       2       4       3       5
      B3       2       3       1       5       4
      C3       5       3       4       1       2
      A4       4       5       3       2       1
      B4       5       3       4       2       1
      C4       1       2       3       4       5
      A5       1       3       2       5       4
      B5       2       1       3       5       4
      C5       2       1       4       5       4

      """), sep=r"\s+")

[formatted to make it easy to paste into the shell]

The problem I want to solve is to pick three distinct elements in each column, distinct meaning from distinct rows in such a way that their sum taken together for all 5 column remains the least.

For example, here if I pick A1,B1,C1 for A and then A2,B2,C2 for B and so on then the sum which is 1+5+2=8 for A and 2+5+1=8 for B and so on when added together i.e. 8+8+... should be the minimum sum of all possible combinations. Note that if A1, B1 and C1 are assigned to A they can't switch to B or any other next column.

What I tried was created all possible combinations starting from A1, B1, C1 to A5, B5 and C5 and computed their sums and sorted it in increasing order and picked first five which had distinct elements as shown below:

Limitations with my code: 1. It takes so much time for the matrix I am optimizing (which is a 30x10 matrix) as the combinations are way too much. 2. It will ignore any scenario where by compromising the scores for the initial elements to a littlebit higher, we may get the mid-scores which can be reduced a lot.

import pandas as pd
data=pd.read_csv("csvfile.csv")
teamsize=3
employes=data["Name"]
PScore=[]
for i in range(10):
    PScore.append(data[f"Project {i+1}"])
Scorings_combo=[]
for i in range(len(employes)):
    for j in range(len(employes)):
        for k in range(len(employes)):
            for l in range(10):
                if i==j or j==k or k==i:
                    break
                score=0
                score=score+PScore[l][i]+PScore[l][j]+PScore[l][k]
                Scorings_combo.append([i+1,j+1,k+1,l+1,score])
a=[Scorings_combo[i][4]for i in range(len(Scorings_combo))]
#b=sorted(a,reverse=True)
b=sorted(a)
emps=[]
sig=1
empl=[]
passigned=[]
countee=0
for i in range(len(b)):
    for j in range(3):
        if Scorings_combo[a.index(b[i])][j] in emps or Scorings_combo[a.index(b[i])][3] in passigned:
            a[a.index(b[i])]=-1
            sig=0
            break
    if sig!=0:
        print("New")
        for k in range(3):emps.append(Scorings_combo[a.index(b[i])][k])
        empl.append(Scorings_combo[a.index(b[i])])
        passigned.append(Scorings_combo[a.index(b[i])][3])
        countee=countee+1
        if count==8:
            break
    sig=1
    print(f"Iteration:{i}/{len(b)}")

For example: 3,3,3,4,9 will be the solution even if the following is possible: 4,4,4,4,4 because it will look for the descending order distinct elements which gives me the first solution.

Kindly do help me if you have any ideas. Thanks

Here is the drive link to the data: https://drive.google.com/file/d/1yaswBEi3RzrhQ743hJTnUeZFZNo-QBBR/view?usp=sharing

Here's a simpler example: Matrix=[[1,2],[2,1],[1,2],[1,2],[2,1],[1,2]] Now minimum possible combination is: for the first column: [1,1,1] and [1,1,2] for the second column.

Todd · Accepted Answer

The question describes a variation of the "stable marriage problem". Which has a pretty simple algorithm; I pasted in the simplest version below - there are other versions that are more stringent. Suitors who aren't engaged, propose to the other party, which either rejects or accepts (removing their current engagement). This goes on until all parties are engaged.

# Gale-Shapley algorithm:
algorithm stable_matching is
    Initialize all m ∈ M and w ∈ W to free
    while ∃ free man m who still has a woman w to propose to do
        w := first woman on m`s list to whom m has not yet proposed
        if w is free then
            (m, w) become engaged
        else some pair (m`, w) already exists
            if w prefers m to m` then
                m` becomes free
                (m, w) become engaged 
            else
                (m`, w) remain engaged
            end if
        end if
    repeat

Modelling a project distribution solution after the stable marriage problem, we can treat the Teammates as the suitors, and the other party as the polygamous Project, which can be engaged to more than one suitor. This may seem like a lot of code, but it's relatively short and easy to follow.

If the above algorithm is followed exactly, the project distributions may contain the 3rd or 4th picks of some of the teammates - which isn't a desirable situation. So, I found that randomizing the order of assignments and retrying several times while keeping track of the best distribution produced the best results.

Each time this is run, it will likely produce an equivalently scored project distribution, but with teammates arranged differently. So if a teammate decides they don't like the 1 or 2 pick for some reason, just generate a new distribution.

Runnable code at: https://pastebin.com/kVj0FuJP

import io
import pandas as pd
from copy import deepcopy
from itertools import cycle
from random import randint, shuffle
from statistics import mean, stdev

class Teammate:
    def __init__(self, identifier, projects, rankings):
        self._id        = identifier
        self._projects  = list(projects)
        self._rankings  = dict(zip(projects, rankings))
        self._prefs     = self.gen_project_preferences_cycle()
        self._project   = None

    @property
    def id(self):
        return self._id

    def rank(self, project):
        return self._rankings[project]

    @property
    def state(self):
        return 'ENGAGED' if self._project else 'FREE'

    def propose(self, project):
        return project.consider(self)

    def engage(self, project):
        self._project = project
        project.add(self)

    def disengage(self, project):
        self._project = None
        project.remove(self)

    @property
    def project_preferences(self):
        return self._prefs

    def gen_project_preferences_cycle(self):
        """ Returns a generator for a cyclical list of preferences in 
        order.
        """
        prefs = sorted(self._projects, key=self.rank)
        if randint(0, 1):
            prefs.insert(0, prefs[1])
        return cycle(prefs)

    def reset(self):
        self._project = None
        self._prefs   = self.gen_project_preferences_cycle()

By randomly assigning the second pick to the front of some Teammate's preference lists, we get an interesting effect on the placements. By starting some teammates off on their 2nd pick, it's likely they're going to get bumped out of their 2nd choice and then find another project they rated 1. The net effect is you get matrices with mostly 1's, a small fraction of 2's, and no 3's and 4's.

The more interesting method of the Project class is the consider() method where the Project considers the engagement proposal of a Teammate. The fickle Project will cast aside any engaged suitor in its set of current suitors, if another comes along with a better offer.

class Project:
    def __init__(self, identifier, seats):
        self._id    = identifier
        self._seats = seats
        self._team  = []

    @property
    def id(self):
        return self._id

    @property
    def score(self):
        return sum(tm.rank(self) for tm in self._team)

    def consider(self, teammate):
        """ Consider a proposal, have the Teammate establish the engagement
        if accepted, then return True; otherwise return False.
        """
        rank    = teammate.rank(self)
        team    = self._team
        seats   = self._seats
        if len(team) < seats or any(rank < tm.rank(self) for tm in team):
            # Either there's a seat available, or the newcomer has a better
            # offer than one of the current suitors.
            if len(team) >= seats:
                shuffle(team)
                tm_out = max(team, key=lambda tm: tm.rank(self))
                tm_out.disengage(self)  # Hit the road Jack...
            teammate.engage(self)  # What? and no ring!?...
            return True
        return False

    def add(self, teammate):
        self._team.append(teammate)

    def remove(self, teammate):
        self._team.remove(teammate)

    def clear(self):
        self._team = []

    @property
    def state(self):
        return 'FREE' if len(self._team) < self._seats else 'ENGAGED'

    @property
    def roster(self):
        self._team.sort(key=lambda tm: tm.id)
        return f"{self.id}: {[(tm.id, tm.rank(self)) for tm in self._team]}"

Of course, in the modern world everyone is using matchmaking services, and so do the Teammates and Projects... The service is a bit quirky though and relies to some extent on random tries until it finds the best matchings for all parties.

class ProjectMatcher:
    def __init__(self, projects, teammates):
        self._projects  = projects
        self._teammates = teammates

    def match(self, num_iterations=1000):
        best_score   = 99999
        best_distrib = None

        for _ in range(num_iterations):
            # This is the piece that does the actual matching. You can 
            # see each line corresponds to a step in the Gale-Shapely
            # algorithm. You can find the details of some steps within
            # the Teammate and Project methods.               

            for teammate in self.gen_free_teammates_cycle():
                for project in teammate.project_preferences:
                    accepted = teammate.propose(project)
                    if accepted:
                        break

            # Determine if this distribution is the best so far.
            if self.evaluate() < best_score:
                best_score   = self.evaluate()
                best_distrib = deepcopy(self._projects)

            # Get ready for another round of matchmaking.
            self.reset()

        # Print out the best distribution of assignments found.
        print("PROJECT ASSIGNMENTS:")
        for project in best_distrib:
            print(f"    {project.roster}")

    def evaluate(self):
        # Determine the quality of the project assignments. We want a
        # low average score, preferably without outliers.
        s = [p.score for p in self._projects]
        m = mean(s)
        d = stdev(s, m)
        return m + d

    def gen_free_teammates_cycle(self):
        """ Returns a cyclical generator of the list of Teammates that are
        still unengaged.
        """
        teammates = list(self._teammates)
        shuffle(teammates)
        while any(tm.state == 'FREE' for tm in teammates):
            for tm in teammates:
                if tm.state == 'FREE':
                    yield tm
            shuffle(teammates)

    def reset(self):
        [p.clear() for p in self._projects]
        [t.reset() for t in self._teammates]

    @staticmethod
    def from_dataframe(df):
        n_seats   = len(df.values) // (len(df.columns) - 1)
        projects  = [Project(proj, n_seats) for proj in df.columns[1:]]
        teammates = [Teammate(tm[0], projects, tm[1:]) for tm in df.values]
        return ProjectMatcher(projects, teammates)

And to put it all together...

if __name__ == '__main__':

    df = pd.read_csv(io.StringIO("""
    employee  proj_A  proj_B  proj_C  proj_D  proj_E
          A1       1       5       3       4       2
          B1       5       4       1       2       3
          C1       2       3       4       1       5
          A2       4       2       1       3       5
          B2       4       5       3       2       1
          C2       3       1       2       5       4
          A3       1       2       4       3       5
          B3       2       3       1       5       4
          C3       5       3       4       1       2
          A4       4       5       3       2       1
          B4       5       3       4       2       1
          C4       1       2       3       4       5
          A5       1       3       2       5       4
          B5       2       1       3       5       4
          C5       2       1       4       5       4
      """), sep=r'\s+')

    pm = ProjectMatcher.from_dataframe(df)

    pm.match()

Output:

$ python projectmatcher.py 
PROJECT ASSIGNMENTS:
    proj_A: [('A1', 1), ('A3', 1), ('C4', 1)]
    proj_B: [('B5', 1), ('C2', 1), ('C5', 1)]
    proj_C: [('A2', 1), ('A5', 2), ('B3', 1)]
    proj_D: [('B1', 2), ('C1', 1), ('C3', 1)]
    proj_E: [('A4', 1), ('B2', 1), ('B4', 1)]

The results are consistently good with maybe 100 iterations, it defaults to 1K for good measure, which only takes a blink of the eye. Somehow it manages to produce high quality matchings despite being based partly on chance.

The program can be modified to collect all the unique distributions that have the lowest score and generate a list. Each of these had the same lowest score, but the distributions are different:

PROJECT ASSIGNMENTS:
    option 1
        proj_A: [('A3', 1), ('A5', 1), ('C4', 1)]
        proj_B: [('B5', 1), ('C2', 1), ('C5', 1)]
        proj_C: [('A2', 1), ('B1', 1), ('B3', 1)]
        proj_D: [('B4', 2), ('C1', 1), ('C3', 1)]
        proj_E: [('A1', 2), ('A4', 1), ('B2', 1)]
    option 2
        proj_A: [('A3', 1), ('A5', 1), ('C4', 1)]
        proj_B: [('B5', 1), ('C2', 1), ('C5', 1)]
        proj_C: [('A2', 1), ('B1', 1), ('B3', 1)]
        proj_D: [('A4', 2), ('C1', 1), ('C3', 1)]
        proj_E: [('A1', 2), ('B2', 1), ('B4', 1)]
    option 3
        proj_A: [('A1', 1), ('A3', 1), ('C4', 1)]
        proj_B: [('B5', 1), ('C2', 1), ('C5', 1)]
        proj_C: [('A2', 1), ('A5', 2), ('B3', 1)]
        proj_D: [('B1', 2), ('C1', 1), ('C3', 1)]
        proj_E: [('A4', 1), ('B2', 1), ('B4', 1)]
    option 4
        proj_A: [('A3', 1), ('A5', 1), ('C4', 1)]
        proj_B: [('B5', 1), ('C2', 1), ('C5', 1)]
        proj_C: [('A2', 1), ('B1', 1), ('B3', 1)]
        proj_D: [('B2', 2), ('C1', 1), ('C3', 1)]
        proj_E: [('A1', 2), ('A4', 1), ('B4', 1)]

How to find set of lowest sum of distinct column elements in python?

Tags:

python

python-3.x

pandas

Shivam Sahil

1 Answers

Todd

Recent Activity

Donate For Us

How to find set of lowest sum of distinct column elements in python?

Tags:

python

python-3.x

pandas

Shivam Sahil

1 Answers

Todd

Related questions

Recent Activity

Donate For Us