Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python3 Does input order matter for the .intersection() function in terms of runtime?

Tags:

Lets say you have two sets, set1 is very large (a couple million values), and set2 is relatively small (a couple hundred thousand values). If I wanted to get the intersection of values between these two sets using the .interstion() function, would there be a runtime improvement based on the order of the inputs?

For example would one of these run faster than the other?

set1.intersection(set2)
set2.intersection(set1)
like image 713
DJ Wolfson Avatar asked Jul 10 '20 17:07

DJ Wolfson


People also ask

What is the time complexity of intersection in Python?

What's the time complexity of set intersection in Python? The time complexity of set intersection in Python on a set with n elements and a set argument with m elements is O(min(n, m)) because you need to check for the smaller set whether each of its elements is a member of the larger set.

How does intersection work in Python?

Intersection() function PythonThe intersection of two given sets is the largest set, which contains all the elements that are common to both sets. The intersection of two given sets A and B is a set which consists of all the elements which are common to both A and B.

How do you find the intersection of two or more sets in Python?

Python count intersection of sets To count the intersection of sets in Python, we will use “len(set(set1) & set(set2))”. Here, ” & “ is an intersection element common to both. It will return the count as “3” because “10, 8, and 6” are common to both the sets.


1 Answers

No, input order does not matter. In CPython (the standard Python implementation), the set_intersection function handles set intersection. In the case where the other argument is also a set, the function will swap the two sets such that the smaller one is iterated through while the larger set is used for (constant time) lookup, as Booboo described:

        if (PySet_GET_SIZE(other) > PySet_GET_SIZE(so)) {
            tmp = (PyObject *)so;
            so = (PySetObject *)other;
            other = tmp;
        }

        while (set_next((PySetObject *)other, &pos, &entry)) {
            key = entry->key;
            hash = entry->hash;
            rv = set_contains_entry(so, key, hash);
            if (rv < 0) {
                Py_DECREF(result);
                return NULL;
            }
            if (rv) {
                if (set_add_entry(result, key, hash)) {
                    Py_DECREF(result);
                    return NULL;
                }
            }
        }

Thus, where set1 and set2 are sets, set1.intersect(set2) and set2.intersect(set1) will have the same performance. A small empirical test with timeit agrees:

import random
import string
import timeit

big_set = set()
while len(big_set) < 1000000:
    big_set.add(''.join(random.choices(string.ascii_letters, k=6)))


small_set = set()
while len(small_set) < 10000:
    small_set.add(''.join(random.choices(string.ascii_letters, k=6)))

print("Timing...")
print(f"big_set.intersection(small_set): {min(timeit.Timer(lambda: big_set.intersection(small_set)).repeat(31, 500))}")
print(f"small_set.intersection(big_set): {min(timeit.Timer(lambda: small_set.intersection(big_set)).repeat(31, 500))}")
like image 178
xavc Avatar answered Sep 16 '22 11:09

xavc