If I have a variable number of sets (let's call the number n), which have at most m elements each, what's the most efficient way to calculate the pairwise intersections for all pairs of sets? Note that this is different from the intersection of all n sets.
For example, if I have the following sets:
A={"a","b","c"}
B={"c","d","e"}
C={"a","c","e"}
I want to be able to find:
intersect_AB={"c"}
intersect_BC={"c", "e"}
intersect_AC={"a", "c"}
Another acceptable format (if it makes things easier) would be a map of items in a given set to the sets that contain that same item. For example:
intersections_C={"a": {"A", "C"},
"c": {"A", "B", "C"}
"e": {"B", "C"}}
I know that one way to do so would be to create a dictionary mapping each value in the union of all n sets to a list of the sets in which it occurs and then iterate through all of those values to create lists such as intersections_C
above, but I'm not sure how that scales as n increases and the sizes of the set become too large.
Some additional background information:
Python count intersection of sets To count the intersection of sets in Python, we will use “len(set(set1) & set(set2))”. Here, ” & “ is an intersection element common to both. It will return the count as “3” because “10, 8, and 6” are common to both the sets.
Pairwise intersection refers to selecting one feature from the first input and intersecting it with the features in the second input that it overlaps.
Intersection() function Python The intersection of two given sets is the largest set, which contains all the elements that are common to both sets. The intersection of two given sets A and B is a set which consists of all the elements which are common to both A and B.
this ought to do what you want
import random as RND
import string
import itertools as IT
mock some data
fnx = lambda: set(RND.sample(string.ascii_uppercase, 7))
S = [fnx() for c in range(5)]
generate an index list of the sets in S so the sets can be referenced more concisely below
idx = range(len(S))
get all possible unique pairs of the items in S; however, since set intersection is commutative, we want the combinations rather than permutations
pairs = IT.combinations(idx, 2)
write a function perform the set intersection
nt = lambda a, b: S[a].intersection(S[b])
fold this function over the pairs & key the result from each function call to its arguments
res = dict([ (t, nt(*t)) for t in pairs ])
the result below, formatted per the first option recited in the OP, is a dictionary in which the values are the set intersections of two sequences; each values keyed to a tuple comprised of the two indices of those sequences
this solution, is really just two lines of code: (i) calculate the permutations; (ii) then apply some function over each permutation, storing the returned value in a structured container (key-value) container
the memory footprint of this solution is minimal, but you can do even better by returning a generator expression in the last step, ie
res = ( (t, nt(*t)) for t in pairs )
notice that with this approach, neither the sequence of pairs nor the corresponding intersections have been written out in memory--ie, both pairs and res are iterators.
If we can assume that the input sets are ordered, a pseudo-mergesort approach seems promising. Treating each set as a sorted stream, advance the streams in parallel, always only advancing those where the value is the lowest among all current iterators. Compare each current value with the new minimum every time an iterator is advanced, and dump the matches into your same-item collections.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With