Lets say you have two sets, set1 is very large (a couple million values), and set2 is relatively small (a couple hundred thousand values). If I wanted to get the intersection of values between these two sets using the .interstion() function, would there be a runtime improvement based on the order of the inputs?
For example would one of these run faster than the other?
set1.intersection(set2)
set2.intersection(set1)
What's the time complexity of set intersection in Python? The time complexity of set intersection in Python on a set with n elements and a set argument with m elements is O(min(n, m)) because you need to check for the smaller set whether each of its elements is a member of the larger set.
Intersection() function PythonThe intersection of two given sets is the largest set, which contains all the elements that are common to both sets. The intersection of two given sets A and B is a set which consists of all the elements which are common to both A and B.
Python count intersection of sets To count the intersection of sets in Python, we will use “len(set(set1) & set(set2))”. Here, ” & “ is an intersection element common to both. It will return the count as “3” because “10, 8, and 6” are common to both the sets.
No, input order does not matter. In CPython (the standard Python implementation), the set_intersection
function handles set intersection. In the case where the other argument is also a set, the function will swap the two sets such that the smaller one is iterated through while the larger set is used for (constant time) lookup, as Booboo described:
if (PySet_GET_SIZE(other) > PySet_GET_SIZE(so)) {
tmp = (PyObject *)so;
so = (PySetObject *)other;
other = tmp;
}
while (set_next((PySetObject *)other, &pos, &entry)) {
key = entry->key;
hash = entry->hash;
rv = set_contains_entry(so, key, hash);
if (rv < 0) {
Py_DECREF(result);
return NULL;
}
if (rv) {
if (set_add_entry(result, key, hash)) {
Py_DECREF(result);
return NULL;
}
}
}
Thus, where set1
and set2
are sets, set1.intersect(set2)
and set2.intersect(set1)
will have the same performance. A small empirical test with timeit
agrees:
import random
import string
import timeit
big_set = set()
while len(big_set) < 1000000:
big_set.add(''.join(random.choices(string.ascii_letters, k=6)))
small_set = set()
while len(small_set) < 10000:
small_set.add(''.join(random.choices(string.ascii_letters, k=6)))
print("Timing...")
print(f"big_set.intersection(small_set): {min(timeit.Timer(lambda: big_set.intersection(small_set)).repeat(31, 500))}")
print(f"small_set.intersection(big_set): {min(timeit.Timer(lambda: small_set.intersection(big_set)).repeat(31, 500))}")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With