I'm so far out of my league on this one, so I'm hoping someone can point me in the right direction. I think this is an optimization problem, but I have been confused by scipy.optimize and how it fits with pulp. Also, matrix math boggles my mind. Therefore this problem has really been slowing me down without to ask.
Problem Statement: I have a dataset of customers. For each customer, I can make a choice of 3 options, or choose none. So 4 options. Also for each customer, I have a numeric score that says how "good" each choice is. You can imagine this value as the probability of the Choice to create a future sale
.
# fake data for the internet
data = {'customerid':[101,102,103,104,105,106,107,108,109,110],
'prob_CHOICEA':[0.00317,0.00629,0.00242,0.00253,0.00421,0.00414,0.00739,0.00549,0.00658,0.00852],
'prob_CHOICEB':[0.061,0.087,0.055,0.027,0.022,0.094,0.099,0.072,0.018,0.052],
'prob_CHOICEC':[0.024,0.013,0.091,0.047,0.071,0.077,0.067,0.046,0.077,0.044]
}
# Creates pandas DataFrame
df = pd.DataFrame(data)
df = df.reset_index(drop=True).set_index(['customerid'])
+------------+--------------+--------------+--------------+
| customerid | prob_CHOICEA | prob_CHOICEB | prob_CHOICEC |
+------------+--------------+--------------+--------------+
| 101 | 0.00317 | 0.061 | 0.024 |
| 102 | 0.00629 | 0.087 | 0.013 |
| 103 | 0.00242 | 0.055 | 0.091 |
| 104 | 0.00253 | 0.027 | 0.047 |
| 105 | 0.00421 | 0.022 | 0.071 |
| 106 | 0.00414 | 0.094 | 0.077 |
| 107 | 0.00739 | 0.099 | 0.067 |
| 108 | 0.00549 | 0.072 | 0.046 |
| 109 | 0.00658 | 0.018 | 0.077 |
| 110 | 0.00852 | 0.052 | 0.044 |
+------------+--------------+--------------+--------------+
I started by combining these elements into a single array for each customer:
# combine all values into 1 array
list_to_combine = ['prob_CHOICEA', 'prob_CHOICEB','prob_CHOICEC']
df['probs_A_B_C']= df[list_to_combine].values.tolist()
df.drop(list_to_combine, axis=1, inplace=True)
+------------+-------------------------+
| customerid | probs_A_B_C |
+------------+-------------------------+
| 101 | [0.00317, 0.061, 0.024] |
| 102 | [0.00629, 0.087, 0.013] |
| 103 | [0.00242, 0.055, 0.091] |
| 104 | [0.00253, 0.027, 0.047] |
| 105 | [0.00421, 0.022, 0.071] |
| 106 | [0.00414, 0.094, 0.077] |
| 107 | [0.00739, 0.099, 0.067] |
| 108 | [0.00549, 0.072, 0.046] |
| 109 | [0.00658, 0.018, 0.077] |
| 110 | [0.00852, 0.052, 0.044] |
+------------+-------------------------+
For each customer, I only have 4 choices of what to do:
choices = [
[0,0,0],
[1,0,0],
[0,1,0],
[0,0,1]
]
For each customer, I want to choose the best choice for each customer. At first glance this was easy - just pick the highest number. However it starts to blow my mind once I start adding constraints.
For example, what if I want to choose the best choice for each customer, but with the constraint that the sum of selected choices is = 5
+------------+-------------------------+-------------+
| customerid | probs_A_B_C | best_choice |
+------------+-------------------------+-------------+
| 101 | [0.00317, 0.061, 0.024] | [0,0,0] |
| 102 | [0.00629, 0.087, 0.013] | [0,1,0] |
| 103 | [0.00242, 0.055, 0.091] | [0,0,1] |
| 104 | [0.00253, 0.027, 0.047] | [0,0,0] |
| 105 | [0.00421, 0.022, 0.071] | [0,0,0] |
| 106 | [0.00414, 0.094, 0.077] | [0,1,0] |
| 107 | [0.00739, 0.099, 0.067] | [0,1,0] |
| 108 | [0.00549, 0.072, 0.046] | [0,0,0] |
| 109 | [0.00658, 0.018, 0.077] | [0,0,1] |
| 110 | [0.00852, 0.052, 0.044] | [0,0,0] |
+------------+-------------------------+-------------+
I did not even figure out how to do this, I just eye-balled it manually for illustrative purposes.
Ideally, I'd like to add multiple constraints simultaneously:
Any ideas of where to start?
You can use scipy.optimize.linprog
to solve this linear optimization problem. It requires to setup the boundary conditions as matrix products, as outlined in the docs. There are two types of boundary conditions, inequalities of the form A @ x <= b
and equality A @ x == b
. The problem can be modeled as follows:
x
has length N*C
where N
is the number of customers and C
is the number of options; it represents the choices per custom in a linear layout: [c1_A, c1_B, c1_C, c2_A, c2_B, c2_C, ..., cN_A, cN_B, cN_C]
.1
if a choice corresponds to the customer and zero otherwise (illustration see below).A @ x <= b
we can invert the values and use -1
entries in A
that correspond to option A and -M
in b
.1
and positive 10
(since it is already of the form <=
).N
. This can be modeled by an equality constraint where the matrix sums over all choices in x
and the result must be equal to N
.This is an illustration of the above constraints:
# Max. one choice per customer.
# A = # b =
[[1, 1, 1, 0, 0, 0, ..., 0, 0, 0], [1,
[0, 0, 0, 1, 1, 1, ..., 0, 0, 0], 1,
... ...
[0, 0, 0, 0, 0, 0, ..., 1, 1, 1]] 1]
# Min. M choices for option A.
# A = # b =
[[-1, 0, 0, -1, 0, 0, ..., -1, 0, 0]] [[-M]]
# Max. 10 choices for option B.
# A = # b =
[[0, 1, 0, 0, 1, 0, ..., 0, 1, 0]] [[10]]
# Total number of choices equals N.
# A = # b =
[[1, 1, 1, 1, 1, 1, ..., 1, 1, 1]] [[N]]
Here's some sample code to setup the constraints and run the optimization:
import numpy as np
import pandas as pd
from scipy.optimize import linprog
data = {'customerid':[101,102,103,104,105,106,107,108,109,110],
'prob_CHOICEA':[0.00317,0.00629,0.00242,0.00253,0.00421,0.00414,0.00739,0.00549,0.00658,0.00852],
'prob_CHOICEB':[0.061,0.087,0.055,0.027,0.022,0.094,0.099,0.072,0.018,0.052],
'prob_CHOICEC':[0.024,0.013,0.091,0.047,0.071,0.077,0.067,0.046,0.077,0.044]
}
# Creates pandas DataFrame
df = pd.DataFrame(data)
df = df.reset_index(drop=True).set_index(['customerid'])
print(df, end='\n\n')
nc = df.shape[1] # number of options
data = df.to_numpy().ravel()
# Max. choices per customer is 1.
A_ub_1 = np.zeros((len(df), len(data)))
for i in range(len(A_ub_1)):
A_ub_1[i, nc*i:nc*(i+1)] = 1
b_ub_1 = np.ones(len(df))
# Min. choices for option A is 3.
A_ub_2 = np.zeros((1, len(data)))
A_ub_2[0, ::nc] = -1 # invert, since this defines an upper boundary
b_ub_2 = np.array([-3])
# Max. choices for option B is 2.
A_ub_3 = np.zeros((1, len(data)))
A_ub_3[0, 1::nc] = 1
b_ub_3 = np.array([2])
# Total sum of choices is 7.
A_eq = np.ones((1, len(data)))
b_eq = np.array([7])
result = linprog(
-1 * data, # linprog aims to minimize the value
A_eq=A_eq, b_eq=b_eq,
A_ub=np.concatenate((A_ub_1, A_ub_2, A_ub_3), axis=0),
b_ub=np.concatenate((b_ub_1, b_ub_2, b_ub_3), axis=0),
bounds=(0, 1)
)
print(result, end='\n\n')
choices = (result.x.reshape(-1, 3) > 1e-6).astype(int)
print('Choices:', choices, sep='\n')
It produces the following results:
prob_CHOICEA prob_CHOICEB prob_CHOICEC
customerid
101 0.00317 0.061 0.024
102 0.00629 0.087 0.013
103 0.00242 0.055 0.091
104 0.00253 0.027 0.047
105 0.00421 0.022 0.071
106 0.00414 0.094 0.077
107 0.00739 0.099 0.067
108 0.00549 0.072 0.046
109 0.00658 0.018 0.077
110 0.00852 0.052 0.044
con: array([-1.30002675e-11])
fun: -0.3812999999903971
message: 'Optimization terminated successfully.'
nit: 7
slack: array([1.00000000e+00, 7.99305067e-11, 1.47325485e-11, 1.00000000e+00,
1.00000000e+00, 2.49527066e-11, 2.42738052e-11, 5.84235438e-10,
4.23596713e-11, 5.77714543e-11, 8.80984175e-12, 1.46305190e-11])
status: 0
success: True
x: array([2.89971936e-10, 1.32732722e-11, 6.97732845e-12, 1.00000000e+00,
3.28055311e-10, 5.72702383e-12, 1.80418885e-11, 4.61391860e-12,
1.00000000e+00, 2.01674011e-10, 4.58311340e-12, 1.29599793e-11,
2.95298295e-10, 4.34109315e-12, 1.21776975e-11, 3.39951283e-11,
1.00000000e+00, 2.55262044e-10, 4.94703751e-11, 1.00000000e+00,
1.57932544e-11, 9.99999999e-01, 2.21487598e-11, 1.33679145e-11,
2.30514296e-10, 3.91129933e-12, 1.00000000e+00, 1.00000000e+00,
8.19015577e-12, 1.07293976e-11])
Choices:
[[0 0 0]
[1 0 0]
[0 0 1]
[0 0 0]
[0 0 0]
[0 1 0]
[0 1 0]
[1 0 0]
[0 0 1]
[1 0 0]]
This problem may be solved using Linear Programming (LP), yet the most difficult part isn't knowing you should use LP, it is to transform your problem into an LP-optimization
problem and I will show you how to do just that. Before proceeding, I will change the example data you provided for simplification purposes (because of the huge amount of generated variables), therefore, assume we have the following input data:
+------------+---------------------+
| customerid | prob A | prob B |
+------------+---------------------+
| 101 | 0.00317 | 0.061 |
| 102 | 0.00629 | 0.087 |
+------------+---------------------+
Assume the size of the input problem is N, where N represents the number of choices:
+---------------------+
| prob A | prob B |
+---------------------+
| 0.00317 | 0.061 |
| 0.00629 | 0.087 |
+------------+--------+
Since we have 4 different choices, N=4
(it doesn't matter that some of them are mutually exclusive, such characteristics will be mapped by the constraints). In LP we have following things to deal with:
1x[at least N]
, it's a row-array), A
of constraints (dimensions depend on how many restrictions you want to add, you may also have more restrictions than variables) andb
, it's dimensions are [number of rows in A]x1
, it's a column-array). Accordingly, an LP maximization problem will have the following format:
Max Cx
subject to:
Ax <= b
x >= 0
Note that from this point forward, we will create LP variables to represent the input data we have, therefore assume the following mapping between xi
and input data
:
+-------------------------------+
| LP variable | Customer | prob |
+-------------------------------+
| x0 | 101 | A |
| x1 | 101 | B |
| x2 | 102 | A |
| x3 | 102 | B |
+-------------------------------+
Let's start by filling our constraints matrix A
and the RHS b
in parallel, we should create a matrix formed by the concatenation of the columns of two NxN
identity matrices:
A
+-----------------------------------------------------+
| x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | b
+-----------------------------------------------------+ +-----+
row 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | | 1 |
row 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | | 1 |
row 2 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | | 1 |
row 3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | | 1 |
+-----------------------------------------------------+ +-----+
We also need to ensure that at most one variable is selected per customer (row of our input data), so we also create one extra variable per customer, in this case, x8
and x9
, and set them to 1
on the respective new 2
rows we will create on A. In addition, the new rows must also have 1s in the variables mapping to each customer (simply take a look at which variables are present in the desired customer). So we add the following 2
rows to matrix A
and column array b
:
A
+------------------------------------------------------------------+
| x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | b
+------------------------------------------------------------------+ +-----+
row 4 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | | 1 |
row 5 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | | 1 |
+------------------------------------------------------------------+ +-----+
Now A becomes:
A
+------------------------------------------------------------------+
| x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | b
+------------------------------------------------------------------+ +-----+
row 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | | 1 |
row 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | | 1 |
row 2 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | | 1 |
row 3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | | 1 |
row 4 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | | 1 |
row 5 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | | 1 |
+------------------------------------------------------------------+ +-----+
Suppose we also want to add a constraint to ensure that at most 2
choices of probs are made in total, then we add row 6 and column x10
to A, setting to 1 variables from x0 to x3 and also x10
:
A
+------------------------------------------------------------------------+
| x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | b
+------------------------------------------------------------------------+ +-----+
row 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | | 1 |
row 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | | 1 |
row 2 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | | 1 |
row 3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | | 1 |
row 4 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | | 1 |
row 5 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | | 1 |
row 6 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | | 2 |
+------------------------------------------------------------------------+ +-----+
Note that in this simple example, restricting the number of choices to at most 2, doesn't impact the final result.
Finally we build the objective function:
C
+-----------------------------------------------------------------------------------+
| x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
+-----------------------------------------------------------------------------------+
| 0.00317 | 0.061 | 0.00629 | 0.087 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+-----------------------------------------------------------------------------------+
The variables which were created but don't have a mapping to the customer's input data are called slack variables and their purpose is to structure correctly the math behind the LP problem.
Now that you know how to model your problem as a LP optimization problem, you must choose a method to solve the problem. I recommend the simplex algorithm, which you may find at scipy.
After running your preferred solver, you must interpret the output result. The result should be an array of a single row containing the value of each xi. The output of the above example I gave, would be:
+------------------------------------------------------------------+
| x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 |
+------------------------------------------------------------------+
| 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+------------------------------------------------------------------+
The above result means you should pick the element represented by variables x1 and x3 since they are set to 1, i. e., customer 101 selects prob B and customer 102 selects prob B as well.
Post Scriptum:
scipy.optimze.linprog
lib to get the job done, just make sure you use the parameter "Aeq" instead of "Aub" for the constraints if you use the above modeling;glpk
which might provide more advanced tools for any further needs in your problem.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With