So, I have several sets, and I need to find the smallest number of sets which contains at least one element from all of the sets. To make this more concrete, I have a set of server names, and each server has a service window. Given a specific duration, I want to find the smallest set(s) of service windows which will cover all of the given servers.
I already have code which generates a list of all non-overlapping N minute time segments for each desired machine. I was going to just brute force it by generating all possible combinations, and picking the one with the smallest number of unique elements, but that seems incredibly inefficient, even if I first reduce the set to just unique windows from all hosts (particularly with more than a few systems).
Then I thought I'd do something like sorting the time slots by the number of hosts which fit into each time slot, picking the slot with the largest number of hosts, then regenerating the list of all slots for the unassigned hosts, picking the most popular slot, recalculating, etc until all hosts were accounted for. And while that would get me an answer, it doesn't really leave me with the opportunity to select the most balanced set - a secondary goal is to find a set of service windows which has a minimal standard deviation in the number of hosts per service window. So, if I have 100 hosts I'd like to give preference to the windows that give me about 50 hosts per window, instead of doing the three "98, 1 and 1" windows which the above algorithm might find. But if my options are "98, 1, 1" or 10 windows with 10 each. I'd rather do the three.
Anyway, it seems like some kind of graph could be used to represent this problem, but I focused more on hardware than software in my CS path, and solving graph problems was never really my forte. So I'd even appreciate some suggestions on where to read more on this particular problem or appropriate search terms. :)
This is the set cover optimization problem, which is NP-hard. That means you can't do better than brute force in the worst case. That is there are some arrangements of sets that you can quickly find a covering for. But you can definitely simplify the problem - especialy with real world data.
I'd start by doing several transformations on the data. Imagine your sets as a grid of booleans. Each column represents a server, each row represents a set. True in a cell means that server exists in that set.
It is likely that you have multiple servers who belong to exactly the same sets. That is, it is likely there will be identical columns in this matrix. You can remove any duplicate columns. So if A and B have exactly the save service windows, getting a covering that includes A will also include B. You can do this fast by hashing each server by the value in this column, and then checking each server in the same hash bucket for identical set membership. Keep just one server, and create a list of "ride along" servers. Eg. from this point forward these servers are treated as a single member of your sets.
Remove any sets which are subsets of a single other set. e.g. if you have {1,2,3} and {1,2}. Remove {1,2} from the problem. When picking sets for your covering {1,2,3} will always be a better choice. So dump the subset - e.g. remove all such rows.
Iterate over each server ( e.g. each column ). Each server that exists in only one set ( e.g. one true in the column ) then that set must be part of the solution. So add that set to the solution, and remove it from the matrix. And now you can remove any columns from the matrix that are in the set that got removed. Say server A exists only in set 1, which is { A,B,C}. Well set 1 must be part of the solution. And if it is, then automatically I know that servers A,B,C will automatically be covered. So I remove set 1 from the matrix and columns for A,B,C.
Remove any servers that are in every set. They will be in any solution.
After this your real problem set should be much reduced unless your data set is "perverse". Real world data will likely be orderly enough that a bunch of rows and columns drop out of the problem.
I'm sure there are good ways of searching for an answer that in practice will give you very close to the optimal covering in a better than brute force time for real world data.
I'd consider DP or A-Star searching solution. I'm out of time. I might sketch it later.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With