Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get all intersections of sets in python fast

I would like to compute all (different) intersections of a collection of finite sets of integers (here implemented as a list of lists) in python (to avoid confusion, a formal definition is at the end of the question):

> A = [[0,1,2,3],[0,1,4],[1,2,4],[2,3,4],[0,3,4]]
> all_intersections(A) # desired output
[[], [0], [1], [2], [3], [4], [0, 1], [0, 3], [0, 4], [1, 2], [1, 4], [2, 3], [2, 4], [3, 4], [0, 1, 4], [0, 3, 4], [1, 2, 4], [2, 3, 4], [0, 1, 2, 3]]

I have an algorithm that does it iteratively, but it is rather slow (should I post it?), a test case would be

[[0, 1, 2, 3, 4, 9], [0, 1, 4, 5, 6, 10], [0, 2, 4, 5, 7, 11], [1, 3, 4, 6, 8, 12], [2, 3, 4, 7, 8, 13], [4, 5, 6, 7, 8, 14], [0, 1, 9, 10, 15, 16], [0, 2, 9, 11, 15, 17], [1, 3, 9, 12, 16, 18], [2, 3, 9, 13, 17, 18], [9, 15, 16, 17, 18, 19], [0, 5, 10, 11, 15, 20], [1, 6, 10, 12, 16, 21], [10, 15, 16, 19, 20, 21], [5, 6, 10, 14, 20, 21], [11, 15, 17, 19, 20, 22], [5, 7, 11, 14, 20, 22], [2, 7, 11, 13, 17, 22], [7, 8, 13, 14, 22, 23], [3, 8, 12, 13, 18, 23], [13, 17, 18, 19, 22, 23], [14, 19, 20, 21, 22, 23], [6, 8, 12, 14, 21, 23], [12, 16, 18, 19, 21, 23]]

which takes me about 2.5 secs to compute.

Any ideas how to do it fast?

Formal definition (actually hard without latex mode): let A = {A1,...,An} be a finite set of finite sets Ai of non-negative integers. The output should then be the set { intersection of the sets in B : B subset of A }.

So the formal algorithm would be to take the union of all intersections of all subsets of A. But that's clearly taking forever.

Many thanks!

like image 456
Christian Avatar asked Feb 06 '23 21:02

Christian


1 Answers

Here is a recursive solution. It is almost instantaneous on your test example:

def allIntersections(frozenSets):
    if len(frozenSets) == 0:
        return []
    else:
        head = frozenSets[0]
        tail = frozenSets[1:]
        tailIntersections = allIntersections(tail)
        newIntersections = [head]
        newIntersections.extend(tailIntersections)
        newIntersections.extend(head & s for s in tailIntersections)
        return list(set(newIntersections))

def all_intersections(lists):
    sets = allIntersections([frozenset(s) for s in lists])
    return [list(s) for s in sets]

On Edit Here is a cleaner, nonrecursive implementation of the same ideas.

The problem is easiest if you define the intersection of an empty collection of sets to be the universal set, and an adequate universal set can be obtained by taking the union of all elements. This is a standard move in lattice-theory, and is dual to taking the union of an empty collection of sets to be the empty set. You could always throw away this universal set if you don't want it:

def allIntersections(frozenSets):
    universalSet = frozenset.union(*frozenSets)
    intersections = set([universalSet])
    for s in frozenSets:
        moreIntersections = set(s & t for t in intersections)
        intersections.update(moreIntersections)
    return intersections

def all_intersections(lists):
    sets = allIntersections([frozenset(s) for s in lists])
    return [list(s) for s in sets]

The reason that this is so fast with your test example is that, even though your collection has 24 sets, hence having 2**24 (16.8 million) potential intersections, there are in fact only 242 (or 241 if you don't count the empty intersection) distinct intersections. Thus the number of intersections in each pass through the loop is in the low hundreds at most.

It is possible to pick 24 sets so that all of the 2**24 possible intersections are in fact different, so it is easy to see that the worst-case behavior is exponential. But if, as in your test example, the number of intersections is small, this approach will allow you to rapidly compute them.

A potential optimization might be to sort the sets in increasing size before you loop over them. Processing the smaller sets up front might result in more empty intersections appearing earlier, thus keeping the total number of distinct intersections smaller until towards the end of the loop.

like image 145
John Coleman Avatar answered Feb 20 '23 04:02

John Coleman