Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set Cover or Hitting Set; Numpy, Least element combinations to make up full set

My goal is to find the least possible number of sub sets [a-f] to make up the full set A.

A = set([1,2,3,4,5,6,7,8,9,10]) # full set


#--- below are sub sets of A ---

a = set([1,2])
b = set([1,2,3])
c = set([1,2,3,4])
d = set([4,5,6,7])
e = set([7,8,9])
f = set([5,8,9,10])

In reality the parent set A that I'm dealing with contains 15k unique elements, with 30k subsets and these sub sets range in lengths from a single unique element to 1.5k unique elements.

As of now the code I'm using looks more or less like the following and is PAINFULLY slow:

import random


B = {'a': a, 'b': b, 'c': c, 'd': d, 'e': e, 'f': f}
Bx = B.keys()
random.shuffle(Bx)

Dict = {}

for i in Bx: # iterate through shuffled keys.
    z = [i]
    x = B[i]
    L = len(x)

    while L < len(A):
        for ii in Bx:
            x = x | B[ii]
            Lx = len(x)
            if Lx > L:
                L = Lx
                z.append(ii)

    try:
        Dict[len(z)].append(z)
    except KeyError:
        Dict[len(z)] = [z]

print Dict[min(Dict.keys()]

This is just to give an idea of the approach I've taken. For clarity I've left out some logic that minimizes iterations on sets that are already too large and other things like that.

I imagine Numpy would be REALLY good at this type of problem but I can't think of a way to use it.

like image 893
Ryder Brooks Avatar asked Feb 23 '14 19:02

Ryder Brooks


People also ask

What is set covering problem with example?

Second, many problems in different industries can be formulated as set covering problems. For example, scheduling machines to perform certain jobs can be thought of as covering the jobs. Picking the optimal location for a cell tower so that it covers the maximum number of customers is another set covering application.

What is weighted set cover?

In the weighted set-cover problem, for each set s ∈ S a weight ws ≥ 0 is also specified, and the goal is to find a set cover C of minimum total weight Ps∈C ws. Weighted set cover is a special case of minimizing a linear function subject to a submodular constraint, defined as follows.

What is set cover in graph?

Set covering is equivalent to the hitting set problem. That is seen by observing that an instance of set covering can be viewed as an arbitrary bipartite graph, with the universe represented by vertices on the left, the sets represented by vertices on the right, and edges representing the inclusion of elements in sets.


1 Answers

The question is asking for an implementation of the Set Cover Problem, for which no fast algorithm exists to find the optimal solution. However, the greedy solution to the problem - to repeatedly choose the subset that contains the most elements that have not been covered yet - does a good job in reasonable time.

You can find an implementation of that algorithm in python at this previous question

Edited to add: @Aaron Hall's answer can be improved by using the below drop-in replacement for his greedy_set_cover routine. In Aaron's code, we calculate a score len(s-result_set) for every remaining subset each time we want to add a subset to the cover. However, this score will only decrease as we add to the result_set; so if on the current iteration we have picked out a best-so-far set with a score higher than the remaining subsets achieved in previous iterations, we know their score cannot improve and can just ignore them. That suggests using a priority queue to store the subsets for processing; in python we can implement the idea with heapq:

# at top of file
import heapq
#... etc

# replace greedy_set_cover
@timer
def greedy_set_cover(subsets, parent_set):
    parent_set = set(parent_set)
    max = len(parent_set)
    # create the initial heap. Note 'subsets' can be unsorted,
    # so this is independent of whether remove_redunant_subsets is used.
    heap = []
    for s in subsets:
        # Python's heapq lets you pop the *smallest* value, so we
        # want to use max-len(s) as a score, not len(s).
        # len(heap) is just proving a unique number to each subset,
        # used to tiebreak equal scores.
        heapq.heappush(heap, [max-len(s), len(heap), s])
    results = []
    result_set = set()
    while result_set < parent_set:
        logging.debug('len of result_set is {0}'.format(len(result_set)))
        best = []
        unused = []
        while heap:
            score, count, s = heapq.heappop(heap)
            if not best:
                best = [max-len(s - result_set), count, s]
                continue
            if score >= best[0]:
                # because subset scores only get worse as the resultset
                # gets bigger, we know that the rest of the heap cannot beat
                # the best score. So push the subset back on the heap, and
                # stop this iteration.
                heapq.heappush(heap, [score, count, s])
                break
            score = max-len(s - result_set)
            if score >= best[0]:
                unused.append([score, count, s])
            else:
                unused.append(best)
                best = [score, count, s]
        add_set = best[2]
        logging.debug('len of add_set is {0} score was {1}'.format(len(add_set), best[0]))
        results.append(add_set)
        result_set.update(add_set)
        # subsets that were not the best get put back on the heap for next time.
        while unused:
            heapq.heappush(heap, unused.pop())
    return results

For comparison, here's the timings for Aaron's code on my laptop. I dropped remove_redundant_subsets as when we use the heap, dominated subsets are never reprocessed anyway:

INFO:root:make_subsets function took 15800.697 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 463.478 ms
INFO:root:greedy_set_cover function took 32662.359 ms
INFO:root:len of results is 46

And here's the timing with the code above; a bit over 3 times faster.

INFO:root:make_subsets function took 15674.409 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 461.027 ms
INFO:root:greedy_pq_set_cover function took 8896.885 ms
INFO:root:len of results is 46

NB: these two algorithms process the subsets in different orders and will occasionally give different answers for the size of the set cover; this is just down to 'lucky' choices of subsets when scores are tied.

The priority queue/heap is a well known optimisation of the greedy algorithm, though I couldn't find a decent discussion of this to link to.

While the greedy algorithm is a quick-ish way to get an approximate answer, you can improve on the answer by spending time afterwards, knowing that we have an upper bound on the minimal set cover. Techniques to do so include simulated annealing, or branch-and-bound algorithms as illustrated in this article by Peter Norvig

like image 75
bazzargh Avatar answered Oct 15 '22 06:10

bazzargh