In python, is it faster to a) Build a set from a list of n items b) Insert n items into a set?
I found this page (http://wiki.python.org/moin/TimeComplexity) but it did not have enough information to conclude which was faster.
It seems, inserting items one at a time could in the worst case take O(n*n) time (given it uses dicts), and O(n*1) in the average case. Does initializing a set with a list offer any performance improvement?
In terms of O()
complexity - it's definitely the same, because both approaches do exactly the same - insert n
items into a set.
The difference comes from implementation: One clear advantage of initialization from an iterable is that you save a lot of Python-level function calls - the initialization from a iterable is done wholly on the C level (**).
Indeed, some tests on a list of 5,000,000 random integers shows that adding one by one is slower:
lst = [random.random() for i in xrange(5000000)]
set1 = set(lst) # takes 2.4 seconds
set2 = set() # takes 3.37 seconds
for item in lst:
set2.add(item)
(**) Looking inside the code of sets (Objects/setobject.c
), eventually item insertion boils down to a call to set_add_key
. When initializing from an iterable, this function is called in a tight C loop:
while ((key = PyIter_Next(it)) != NULL) {
if (set_add_key(so, key) == -1) {
Py_DECREF(it);
Py_DECREF(key);
return -1;
}
Py_DECREF(key);
}
On the other hand, each call to set.add
invokes attribute lookup, which resolves to the C set_add
function, which in turn calls set_add_key
. Since the item addition itself is relatively quick (the hash table implementation of Python is very efficient), these extra calls all build up.
$ python -V
Python 2.5.2
$ python -m timeit -s "l = range(1000)" "set(l)"
10000 loops, best of 3: 64.6 usec per loop
$ python -m timeit -s "l = range(1000)" "s = set()" "for i in l:s.add(i)"
1000 loops, best of 3: 307 usec per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With