Linear Programing- Max value optimization

Tags:

I'm trying to find the best possible combination that will maximize my sum value, but it has to be under 2 specific constraints, therefore I am assuming Linear programming will be the best fit.

The problem goes like this: Some educational world-event wish to gather the world's smartest teen students. Every state tested 100K students on the following exams:'MATH', 'ENGLISH', 'COMPUTERS', 'HISTORY','PHYSICS'.. and where graded 0-100 on EACH exam.

Every state was requested to send their best 10K from the tested 100K students for the event.

You, as the French representative, were requested to choose the top 10K students from the tested 100K student from your country. For that, you'll need to optimize the MAX VALUE from them in order to get the best possible TOTAL SCORE.

BUT there are 2 main constrains:

1- from the total 10K chosen students you need to allocate specific students that will be tested on the event on 1 specific subject only from the mentioned 5 subjects. the allocation needed is: ['MATH': 4000, 'ENGLISH':3000,'COMPUTERS':2000, 'HISTORY':750,'PHYSICS':250]

2- Each 'exam subject' score will have to be weighted differently.. for exp: 97 is Math worth more than 97 in History. the wheights are: ['MATH': 1.9, 'ENGLISH':1.7,'COMPUTERS':1.5, 'HISTORY':1.3,'PHYSICS':1.1]

MY SOLUTION: I tried to use the PULP (python) as an LP library and solved it correctly, but it took more than 2 HOURS of running. can you find a better (faster, simpler..) way to solve it? there are some NUMPY LP functions that could be used instead, maybe will be faster? it supposed to be a simple OPTIMIZATION problem be I made it too slow and complexed. --The solution needs to be in Python only please

for example, let's look on a small scale at the same problem: there are 30 students and you need to choose only 15 students that will give us the best combination in relation to the following subject allocation demand. the allocation needed is- ['MATH': 5, 'ENGLISH':4,'COMPUTERS':3, 'HISTORY':2,'PHYSICS':1]

this is all the 30 students and their grades:

enter image description here

after running the algorithm, the output solution will be:

enter image description here

here is my full code for the ORIGINAL question (100K students):

Click to copy

import pandas as pd
import numpy as np
import pulp as p
import time    
t0=time.time()

demand = [4000, 3000, 2000, 750,250]
weight = [1.9,1.7, 1.5, 1.3, 1.1]


original_data= pd.read_csv('GRADE_100K.csv') #created simple csv file with random scores
data_c=original_data.copy()
data_c.index = np.arange(1, len(data_c)+1)
data_c.columns
data_c=data_c[['STUDENT_ID', 'MATH', 'ENGLISH', 'COMPUTERS', 'HISTORY','PHYSICS']]

#DataFrame Shape
m=data_c.shape[1]
n=data_c.shape[0]

data=[]
sublist=[]
for j in range(0,n):
    for i in range(1,m):
        sublist.append(data_c.iloc[j,i])
    data.append(sublist)
    sublist=[]


def _get_num_students(data):
    return len(data)


def _get_num_subjects(data):
    return len(data[0])


def _get_weighted_data(data, weight):
    return [
        [a*b for a, b in zip(row, weight)]
        for row in data
    ]



data = _get_weighted_data(data, weight)
num_students = _get_num_students(data)
num_subjects = _get_num_subjects(data)

# Create a LP Minimization problem
Lp_prob = p.LpProblem('Problem', p.LpMaximize)

# Create problem Variables
variables_matrix = [[0 for i in range(num_subjects)] for j in range(num_students)]
for i in range(0, num_students):
    for j in range(0, num_subjects):
        variables_matrix[i][j] = p.LpVariable(f"X({i+1},{j+1})", 0, 1, cat='Integer')


df_d=pd.DataFrame(data=data)
df_v=pd.DataFrame(data=variables_matrix)

ml=df_d.mul(df_v)

ml['coeff'] = ml.sum(axis=1)

coefficients=ml['coeff'].tolist()


# DEALING WITH TARGET FUNCTION VALUE

suming=0
k=0
sumsum=[]
for z in range(len(coefficients)):
    suming +=coefficients[z] 
    if z % 2000==0:
        sumsum.append(suming) 
        suming=0
if z<2000:
    sumsum.append(suming) 

sumsuming=0
for s in range(len(sumsum)):
    sumsuming=sumsuming+sumsum[s]        

Lp_prob += sumsuming   



# DEALING WITH the 2 CONSTRAINS

# 1-subject constraints

con1_suming=0
for e in range(num_subjects):
    L=df_v.iloc[:,e].to_list() 
    for t in range(len(L)):
        con1_suming +=L[t]
    Lp_prob += con1_suming <= demand[e] 
    con1_suming=0 


 # 2- students constraints

con2_suming=0
for e in range(num_students):
    L=df_v.iloc[e,:].to_list() 
    for t in range(len(L)):
        con2_suming +=L[t]
    Lp_prob += con2_suming <= 1 
    con2_suming=0        
print("time taken for TARGET+CONSTRAINS %8.8f seconds" % (time.time()-t0) )


t1=time.time()

status = Lp_prob.solve()  # Solver
print("time taken for SOLVER %8.8f seconds" % (time.time()-t1) )  # 632 SECONDS
print(p.LpStatus[status])  # The solution status
print(p.value(Lp_prob.objective))



df_v=pd.DataFrame(data=variables_matrix)


# Printing the final solution
lst=[]
val=[]

for i in range(0, num_students):
    lst.append([p.value(variables_matrix[i][j]) for j in range(0, num_subjects)])
    val.append([sum([p.value(variables_matrix[i][j]) for j in range(0, num_subjects)]),i])


ones_places=[]
for i in range (0, len(val)):
    if val[i][0]==1:
        ones_places.append(i+1)       
len(ones_places)  


data_once=data_c[data_c['STUDENT_ID'].isin(ones_places)]

IDs=[]
for i in range(len(ones_places)):
    IDs.append(data_once['STUDENT_ID'].to_list()[i])

course=[]
sub_course=[]
for i in range(len(lst)):
    j=0
    sub_course='x'
    while j<len(lst[i]):
        if lst[i][j]==1:
           sub_course=j 
        j=j+1
    course.append(sub_course)

coures_ones=[]
for i in range(len(course)):
    if course[i]!= 'x':
        coures_ones.append(course[i])     

# adding the COURSE name to the final table
# NUMBER OF DICTIONARY KEYS based on number of COURSES      
col=original_data.columns.values[1:].tolist()   
dic = {0:col[0], 1:col[1], 2:col[2], 3:col[3], 4:col[4]}
cc_name=[dic.get(n, n) for n in coures_ones]
     
one_c=[]
if len(IDs)==len(cc_name):
    for i in range(len(IDs)):
        one_c.append([IDs[i],cc_name[i]])

prob=[] 
if len(IDs)==len(cc_name):
    for i in range(len(IDs)):
        prob.append([IDs[i],cc_name[i], data_once.iloc[i][one_c[i][1]]])

scoring_table = pd.DataFrame(prob,columns=['STUDENT_ID','COURES','SCORE'])
scoring_table.sort_values(by=['COURES', 'SCORE'], ascending=[False, False], inplace=True)
scoring_table.index = np.arange(1, len(scoring_table)+1)


print(scoring_table)

711

asked Oct 14 '20 15:10

Galgo

1 Answers

Here are some more ideas on my idea of using min cost flows.

We model this problem by taking a directed graph with 4 layers, where each layer is fully connected to the next.

Nodes

First layer: A single node s that will be our source.
Second layer: One node for each student.
Third layer: One node for each subject.
Fourth layer: OA single node t that will be our drain.

Edge Capacities

First -> Second: All edges have capacity 1.
Second -> Third: All edges have capacity 1.
Third -> Fourth: All edges have the capacity corresponding to the number students that has to be assigned to that subject.

Edge Costs

First -> Second: All edges have cost 0.
Second -> Third: Remember that edges in this layer connect a student with a subject. The costs on these will be chosen anti proportional to the weighted score the student has on that subject. cost = -subject_weight*student_subject_score.
Third -> Fourth: All edges have cost 0.

Then we demand a flow from s to t equal to the number of students we have to choose.

Why does this work?

A solution to the min cost flow problem will correspond to a solution of your problem by taking all the edges between the third and fourth layer as assignments.

Each student can be chosen for at most one subject, as the corresponding node has only one incoming edge.

Each subject has exactly the number of required students, as the outgoing capacity corresponds to the number of students we have to choose for this subject and we have to use the full capacity of these edges, as we can not fulfil the flow demand otherwise.

A minimal solution to the MCF problem corresponds to the maximal solution of your problem, as the costs corresponds to the value they give.

As you asked for a solution in python I implemented the min cost flow problem with ortools. Finding a solution took less than a second in my colab notebook. What takes "long" is the extraction of the solution. But including setup and solution extraction I am still having a runtime of less than 20s for the full 100000 student problem.

Code

Click to copy

# imports
from ortools.graph import pywrapgraph
import numpy as np
import pandas as pd
import time
t_start = time.time()

# setting given problem parameters
num_students = 100000
subjects = ['MATH', 'ENGLISH', 'COMPUTERS', 'HISTORY','PHYSICS']
num_subjects = len(subjects)
demand = [4000, 3000, 2000, 750, 250]
weight = [1.9,1.7, 1.5, 1.3, 1.1]

# generating student scores
student_scores_raw = np.random.randint(101, size=(num_students, num_subjects))

# setting up graph nodes
source_nodes = [0]
student_nodes = list(range(1, num_students+1))
subject_nodes = list(range(num_students+1, num_subjects+num_students+1))
drain_nodes = [num_students+num_subjects+1]

# setting up the min cost flow edges
start_nodes = [int(c) for c in (source_nodes*num_students + [i for i in student_nodes for _ in subject_nodes] + subject_nodes)]
end_nodes   = [int(c) for c in (student_nodes + subject_nodes*num_students + drain_nodes*num_subjects)]
capacities  = [int(c) for c in ([1]*num_students + [1]*num_students*num_subjects + demand)]
unit_costs  = [int(c) for c in ([0.]*num_students + list((-student_scores_raw*np.array(weight)*10).flatten()) + [0.]*num_subjects)]
assert len(start_nodes) == len(end_nodes) == len(capacities) == len(unit_costs)

# setting up the min cost flow demands
supplies = [sum(demand)] + [0]*(num_students + num_subjects) + [-sum(demand)]

# initialize the min cost flow problem instance
min_cost_flow = pywrapgraph.SimpleMinCostFlow()
for i in range(0, len(start_nodes)):
  min_cost_flow.AddArcWithCapacityAndUnitCost(start_nodes[i], end_nodes[i], capacities[i], unit_costs[i])
for i in range(0, len(supplies)):
  min_cost_flow.SetNodeSupply(i, supplies[i])

# solve the problem
t_solver_start = time.time()
if min_cost_flow.Solve() == min_cost_flow.OPTIMAL:
  print('Best Value:', -min_cost_flow.OptimalCost()/10)
  print('Solver time:', str(time.time()-t_solver_start)+'s')
  print('Total Runtime until solution:', str(time.time()-t_start)+'s')
  
  #extracting the solution
  solution = []
  for i in range(min_cost_flow.NumArcs()):
    if min_cost_flow.Flow(i) > 0 and min_cost_flow.Tail(i) in student_nodes:
      student_id = min_cost_flow.Tail(i)-1
      subject_id = min_cost_flow.Head(i)-1-num_students
      solution.append([student_id, subjects[subject_id], student_scores_raw[student_id, subject_id]])
  assert(len(solution) == sum(demand))

  solution = pd.DataFrame(solution, columns = ['student_id', 'subject', 'score'])
  print(solution.head())
else:
  print('There was an issue with the min cost flow input.')

print('Total Runtime:', str(time.time()-t_start)+'s')

Replacing the for-loop for the solution extraction in the above code by the following list-comprehension (that is not also not using list lookups every iteration) the runtime can be improved significantly. But for readability reasons I will leave this old solution here as well. Here is the new one:

Click to copy

  solution = [[min_cost_flow.Tail(i)-1, 
               subjects[min_cost_flow.Head(i)-1-num_students], 
               student_scores_raw[min_cost_flow.Tail(i)-1, min_cost_flow.Head(i)-1-num_students]
               ]
              for i in range(min_cost_flow.NumArcs())
              if (min_cost_flow.Flow(i) > 0 and 
                  min_cost_flow.Tail(i) <= num_students and 
                  min_cost_flow.Tail(i)>0)
              ]

The following output is giving the runtimes for the new faster implementation.

Output

Click to copy

Best Value: 1675250.7
Solver time: 0.542395830154419s
Total Runtime until solution: 1.4248979091644287s
   student_id    subject  score
0           3    ENGLISH     99
1           5       MATH     98
2          17  COMPUTERS    100
3          22  COMPUTERS    100
4          33    ENGLISH    100
Total Runtime: 1.752336025238037s

Pleas point out any mistakes I might have made.

I hope this helps. ;)

answered Sep 21 '22 12:09

SimonT

Related questions
                            
                                Even though tuples are immutable, they are stored in different addresses in interactive mode. Why?
                            
                                Delete an element from torch.Tensor
                            
                                Why does django's `apps.get_model()` return a `__fake__.MyModel` object
                            
                                ValueError: illegal value in 4-th argument of internal None when running sklearn LinearRegression().fit()
                            
                                how to download all the python packages mentioned in the requirement.txt to a folder in linux?
                            
                                Create CSV from XML/Json using Python Pandas
                            
                                Length (count) of sequences with start and end condition Python
                            
                                Unexpected number of bins in Pandas DataFrame resample
                            
                                Convert epoch, which is midnight 01/01/0001, to DateTime in pandas
                            
                                matplotlib text: Use data coords for x, axis coords for y
                            
                                Implementing a recursive algorithm in pyspark to find pairings within a dataframe
                            
                                Trying to understand __init__.py combined with getattr
                            
                                Implementing inplace operations for methods in a class
                            
                                How can I list the extra features of a Python package
                            
                                multiprocessing in python - what gets inherited by forkserver process from parent process?
                            
                                Get hour of year from a Datetime
                            
                                How to terminate loop.run_in_executor with ProcessPoolExecutor gracefully?
                            
                                How to group a dataframe by 4 time periods and key
                            
                                Filtering non-'cohorts' from dataset
                            
                                How to solve "RuntimeError: CUDA error: invalid device ordinal"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Linear Programing- Max value optimization

Tags:

python

optimization

pandas

maximize

linear-programming

Galgo

People also ask

1 Answers

SimonT

Recent Activity

Donate For Us