Pairwise Set Intersection in Python

Tags:

If I have a variable number of sets (let's call the number n), which have at most m elements each, what's the most efficient way to calculate the pairwise intersections for all pairs of sets? Note that this is different from the intersection of all n sets.

For example, if I have the following sets:

A={"a","b","c"}
B={"c","d","e"}
C={"a","c","e"}

I want to be able to find:

intersect_AB={"c"}
intersect_BC={"c", "e"}
intersect_AC={"a", "c"}

Another acceptable format (if it makes things easier) would be a map of items in a given set to the sets that contain that same item. For example:

intersections_C={"a": {"A", "C"},
                 "c": {"A", "B", "C"}
                 "e": {"B", "C"}}

I know that one way to do so would be to create a dictionary mapping each value in the union of all n sets to a list of the sets in which it occurs and then iterate through all of those values to create lists such as intersections_C above, but I'm not sure how that scales as n increases and the sizes of the set become too large.

Some additional background information:

Each of the sets are of roughly the same length, but are also very large (large enough that storing them all in memory is a realistic concern, and an algorithm which avoids that would be preferred though is not necessary)
The size of the intersections between any two sets is very small compared to the size of the sets themselves
If it helps, we can assume anything we need to about the ordering of the input sets.

583

asked Dec 09 '14 00:12

ankushg

2 Answers

this ought to do what you want

import random as RND
import string
import itertools as IT

mock some data

fnx = lambda: set(RND.sample(string.ascii_uppercase, 7))
S = [fnx() for c in range(5)]

generate an index list of the sets in S so the sets can be referenced more concisely below

idx = range(len(S))

get all possible unique pairs of the items in S; however, since set intersection is commutative, we want the combinations rather than permutations

pairs = IT.combinations(idx, 2)

write a function perform the set intersection

nt = lambda a, b: S[a].intersection(S[b])

fold this function over the pairs & key the result from each function call to its arguments

res = dict([ (t, nt(*t)) for t in pairs ])

the result below, formatted per the first option recited in the OP, is a dictionary in which the values are the set intersections of two sequences; each values keyed to a tuple comprised of the two indices of those sequences

this solution, is really just two lines of code: (i) calculate the permutations; (ii) then apply some function over each permutation, storing the returned value in a structured container (key-value) container

the memory footprint of this solution is minimal, but you can do even better by returning a generator expression in the last step, ie

res = ( (t, nt(*t)) for t in pairs )

notice that with this approach, neither the sequence of pairs nor the corresponding intersections have been written out in memory--ie, both pairs and res are iterators.

answered Sep 28 '22 19:09

doug

If we can assume that the input sets are ordered, a pseudo-mergesort approach seems promising. Treating each set as a sorted stream, advance the streams in parallel, always only advancing those where the value is the lowest among all current iterators. Compare each current value with the new minimum every time an iterator is advanced, and dump the matches into your same-item collections.

answered Sep 28 '22 20:09

tzaman

Related questions
                            
                                AWS Lambda Python 3.7 runtime exception logging
                            
                                psycopg2 cannot connect to docker image
                            
                                IPC shared memory across Python scripts in separate Docker containers
                            
                                VS Code / Python / Debugging pytest Test with the Debugger
                            
                                Python complex event processing
                            
                                How to clean up after subprocess.Popen?
                            
                                Python decorator for automatic binding __init__ arguments
                            
                                How do I start and stop a Linux program using the subprocess module in Python?
                            
                                Overriding __getattr__ to support dynamic nested attributes
                            
                                Getting an embedded Python runtime to use the current active virtualenv
                            
                                Classifiers confidence in opencv face detector
                            
                                Git-backed ORM for Python?
                            
                                Apply automatic pep8 fixes from QuickFix window
                            
                                Sharing object (class instance) using multiprocessing Managers
                            
                                tracing memory leaks in Python (multiprocessing)
                            
                                Passing the library path as a command line argument to setup.py
                            
                                Django unable to load test fixtures, IntegrityError
                            
                                Import errors with Pycharm
                            
                                Community detection in Networkx
                            
                                Scipy -- 3d griddata -- Why is it necessary to cast griddata xi argument to tuple?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pairwise Set Intersection in Python

Tags:

python

set

set-intersection

ankushg

People also ask

2 Answers

doug

tzaman

Recent Activity

Donate For Us