Solving a Modified Subset Sum - Search Algorithms

Question

I have a problem which is similar to a subset sum but with a few modifications:

Subset Sum: Find all ways the following list can sum to 16:

[2,9,40,2,404,12....]

My Problem: Find all ways you can select n items (integer, discrete), each which has 3 properties - length,width,and height, such that the combined total of each property is as close to a specific target as possible.

Example:

items = [(92,10,15),(8,18,34),(29,50,110) ...]
l,w,h = [150,200,180]
n = 3

We select 3 items such that the sum of the length, width, and height across all 3 items is as close as possible 150, 200, and 180. This error can be formulated as a simple distance metric:

error = sum(abs(x-y) for x,y in zip(targets,actuals))

This is a contrived example but the general problem I'm solving is the same, just with more variables and constraints.

One obvious solution is just an exhaustive graph search of some sort, but I expect there are superior approaches, perhaps in the constraint programming domain.

Note:The solution should execute in a few seconds. Finding an approximate solution quickly with low error is more important than finding the global optimum. It doesn't matter if we undershoot or overshoot the targets, which is why the error uses absolute distance.

What is a "low error"? I'm able to get within ~5% of the targets when selecting from 1M items, in < 3 seconds. I would prefer to get that error down if possible, but 1%-5% is acceptable.

Please take into consideration my prior attempts outlined below before recommending an approach.

Solaxun · Accepted Answer

Chosen Solution - Random Restart Hill Climbing

After a day of experimentation and research, the best approach I've found was also the most simple one. Just repeatedly trying random configurations until I got an acceptable solution (e.g. "Random Restart Hill Climbing"). Simulated annealing works well too, but I haven't noticed a big enough improvement to justify the extra complexity compared to RRHC - however I admittedly did not spend any real effort tuning the parameters (temperature, schedule, etc.).

I should say that I'm fairly happy with this approach so far - it returns a "good enough" answer pretty quickly, is tunable if I want a more exhaustive search vs faster (more vs less trials), and is flexible in what I accept as "good enough" in the sense that I can just modify the "cost" function - in this case a simple distance between the target l,w,h and the current l,w,h. If, for example, I wanted to penalize being off on height more severely than length or width I could simply multiply the (htarget - hactual) * 2 and give more weight to that particular part of the cost.

That said, for educational purposes I want to go through the other approaches I tried, and see if maybe someone can point out where I went wrong in my backtracking approach specifically. Or alternatively, if I didn't miss anything and maybe RRHC is just a better algorithm for this type of problem, that would be nice to know as well.

This is not the first time I've found RRHC to work well for problems where you need a "good enough" solution and where finding the global optimum is "icing on the cake". In my experience it's a hammer for a lot of nails - surprisingly effective given it's simplicity. One extension I might try later is running multiple parallel threads of search - even better.

Attempted Solution - Backtracking Search

I spent quite a bit of time working on a backtracking search, but my implementation wastes time searching infeasible parts of the solution space. With enough effort I think I could come up with better heuristics to prune the search space, but for now I've given up.
The problem I ran into with the backtracking approach is the following: Let's say I want to select 4 pieces of furniture and my target l,w,h is 140,80,300.

I run the algorithm:

1st selection: Furniture(30,20,120) - all constraints good
2nd selection: Furniture(100,50,40) - all constraints good
3rd ... etc.

All constraints good simply means I haven't exceeded the targets. Additionally, at each selection step, I filter the list of furniture so that any items which would exceed the remaining l,w,h are removed.

For example, after 2nd selection we have assigned 30+100,20+50,40+120 = 130,70,160 leaving a remaining l,w,h to allocate of 140-130, 80-70, 300-160 = 10,10,140. Any item which has a l>10, w>10, or h>140 wouled be filtered out before we continue the search.

The problem is that while this filtering will make sure to exclude any items that would be in conflict at the immediate next step, it does not consider that a certain state will eventually become a dead-end at later steps. If the only pieces of furniture left now all have a length of 10, they will pass this filtering step yet we know that the current arrangement cannot possibly lead to a solution because by the fourth step there is no budget left for length - we will have exhausted it on step 3 when we assign 10 leaving zero remaining length.

Hacky workaround?

I can do a form of "forward checking" at every step:

find the min and max for each l,w,h property in the remaining furniture
if at step 2 we have 140 h remaining, but the maxh is 60, we know we can never arrive at a solution since 2*60 = 120 < 140.

In general - at every step n we filter the furniture list such that:

the max for a given attribute * n steps >= attribute remaining to be allocated.
the min for a given attribute * n steps <= attribute remaining to be allocated.

I think this would work, and terminate parts of the search space which will not be feasible much higher up in the search tree, but it feels hacky.

Conclusion

So - is the proposed solution reasonable? Do you think I missed any approaches which would be a better fit (including some of the alternative approaches I tried unsuccessfully?)

AirSquid · Answer

This is a slight variation of the knapsack problem, which is very common in Linear Programming. Picking things based on some value with some limit on another characteristic.

In this case, we want to use the total "miss distance" for each dimension as the valuation (lower = better). Because we want to consider low misses and high misses, we need to introduce a variable to capture that, and constrain it to be larger than a possible "low miss" (target > sum) and also constrain it to be larger than a "high miss" (sum > target). Then just take the sum of miss in all 3 dimensions.

Many variants are possible. You could switch to "Price is Right" scoring and not allow the dimensional totals to be greater than their limits by adding those constraints, etc.

Anyhow, this is a fairly straightforward MILP. It solves for 1000 boxes almost instantly using the commented-out "GO BIG". (DO NOT try that with enumeration! Choose(2000, 1000) is a large number). The problem becomes more complicated if you don't restrict the number of boxes selected, but that's another story.

Note: As stated in the comment, I need to use a separate instance of the cbc solver on my machine. If you want to replicate this by installing pulp, it will install a default copy of cbc and you should just need the one-liner that I show beneath.

CODE:

# box selection

from random import randint
import pulp

num_boxes = 10
num_selections = 5
targets = {'l': 200, 'w': 600, 'h': 300}

# GO BIG
# num_boxes = 2000
# num_selections = 1000
# targets = {'l': 40_000, 'w': 120_000, 'h': 60_000}

# Make a list of boxes to pick from
boxes = [{'l': randint(20, 80), 'w': randint(100, 200), 'h': randint(50, 65)} for _ in range(num_boxes)]

# SET UP THE LP
prob = pulp.LpProblem('box_picker', sense=pulp.LpMinimize)

# SETS
D = ['l', 'w', 'h']
B = list(range(num_boxes))

# VARS
pick = pulp.LpVariable.dicts('pick', indices=B, cat=pulp.LpBinary)  # 1 if picked, else 0
miss = pulp.LpVariable.dicts('miss', indices=D)  # the amount that we missed the target by

# OBJ:  Minimize the total miss distance, all dimensions are equally weighted
prob += pulp.lpSum(miss)

# CONSTRAINTS:

# connect the "miss" variable to the selections * dimensions
# Because the "miss" variable represents an absolute values of the difference beteen target and total,
# we must constrain it for positive and negative misses
for dim in D:
    prob += miss[dim] >=   sum(boxes[box][dim] * pick[box] for box in B) - targets[dim]     # pos delta
    prob += miss[dim] >= -(sum(boxes[box][dim] * pick[box] for box in B) - targets[dim])    # neg delta

# force correct number of selections
prob += pulp.lpSum(pick) == num_selections
# check it
# print(prob)

# SOLVE IT
cbc_path = '/opt/homebrew/opt/cbc/bin/cbc'   # this lis likely not necessary for a normal install
solver = pulp.COIN_CMD(path=cbc_path)
res = prob.solve(solver)

# prob.solve()    # <---this is normally OK instead of 3 lines above

# make sure we got optimal result
assert prob.status == pulp.LpStatusOptimal

tot_selections = sum(pick[b].varValue for b in B)
print(f'Selected {tot_selections} Boxes:')
for b in B:
    if pick[b].varValue > 0.1:  # screen out the zeros (or small numerical errors)
        print(boxes[b])

print('
Dimensional "Misses":')
for d in D:
    print(d, miss[d].varValue)

Output:

Result - Optimal solution found

Objective value:                152.00000000
Enumerated nodes:               0
Total iterations:               0
Time (CPU seconds):             0.00
Time (Wallclock seconds):       0.00

Option for printingOptions changed from normal to all
Total time (CPU seconds):       0.00   (Wallclock seconds):       0.00

Selected 5.0 Boxes:
{'l': 39, 'w': 136, 'h': 60}
{'l': 39, 'w': 140, 'h': 53}
{'l': 64, 'w': 125, 'h': 57}
{'l': 33, 'w': 161, 'h': 57}
{'l': 51, 'w': 156, 'h': 65}

Dimensional "Misses":
l 26.0
w 118.0
h 8.0

Solving a Modified Subset Sum - Search Algorithms

Tags:

python

algorithm

optimization

combinatorics

constraint-programming

Solaxun

2 Answers

Chosen Solution - Random Restart Hill Climbing

Attempted Solution - Backtracking Search

Hacky workaround?

Other Solutions Considered

Conclusion

Solaxun

CODE:

Output:

AirSquid

Recent Activity

Donate For Us

Solving a Modified Subset Sum - Search Algorithms

Tags:

python

algorithm

optimization

combinatorics

constraint-programming

Solaxun

2 Answers

Chosen Solution - Random Restart Hill Climbing

Attempted Solution - Backtracking Search

Hacky workaround?

Other Solutions Considered

Conclusion

Solaxun

CODE:

Output:

AirSquid

Related questions

Recent Activity

Donate For Us