I’m running filter
on an interable and want to store the result in a sequence (I need a sequence so that I can use random.choice
on it). I noticed that creating a set from a filter object is a lot faster than creating a list or a tuple. Why is that? I first though that the filter type is a subtype of set, which would explain this, but the filter
function is actually identical to a generator expression, so it cannot really be a set internally.
I ran the following test to check the speed:
import time
def test ( n, seq ):
for method in ( set, list, tuple ):
t = time.time()
for i in range( n ):
method( seq )
print( method.__name__, ( time.time() - t ) )
someFilter = filter( lambda x: x % 3 == 0, range( 1000 ) )
test( 10000000, someFilter )
And the results were speaking clearly for using a set:
set 1.9240000247955322
list 8.82200002670288
tuple 7.031999826431274
So why is creating a set from a filter so much faster? Shouldn’t it usually take as long as creating a set from a sequence, where every element has to be hashed? Or is it somehow getting a boost from the internal filter representation?
For comparison, when running the test on a range
expression, set
takes about twice as long as list
and tuple
(which are both nearly identical in speed).
Sven’s answer is totally right, but for the completeness an updated test that will run on an actual filter:
import time
def testFilter ( n, test, rangeSize ):
for method in ( set, list, tuple ):
t = time.time()
for i in range( n ):
method( filter( test, range( rangeSize ) ) )
print( method.__name__, ( time.time() - t ) )
testFilter( 100000, lambda x: x % 3 == 0, 1000 )
The result actually shows what makes more sense with list
and tuple
both being the fastest, although set isn’t really slower, so it won’t make any difference what to use:
set 27.868000030517578
list 27.131999969482422
tuple 27.138000011444092
Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1).
Tuple is stored in a single block of memory. Creating a tuple is faster than creating a list. Creating a list is slower because two memory blocks need to be accessed. An element in a tuple cannot be removed or replaced.
Because sets cannot have multiple occurrences of the same element, it makes sets highly useful to efficiently remove duplicate values from a list or tuple and to perform common math operations like unions and intersections.
Tuples are stored in a single block of memory. Tuples are immutable so, It doesn't require extra space to store new objects. Lists are allocated in two blocks: the fixed one with all the Python object information and a variable-sized block for the data. It is the reason creating a tuple is faster than List.
filter()
returns an iterator in Python 3, and this iterator will be consumed on the first run of the inner for-loop. After that, you are only measuring the speed of the contructor -- that's why you have to repeat it so often to make it consume at least a bit of time.
So it seems that the constructor of set()
is the fastest one in dealing with an empty iterator.
When timings suggest results that are illogical, often it is the timing suite itself that is at fault ;-)
Try using the timeit module which can help steer you clear of common timing mistakes. In particular, you want to run a fresh setup for each test and only time the loop body rather than the body plus the test code.
In this case, it would have at least made the timings comparable (all of them would have used a fresh iterator as returned by Python 3's version of *filter) and it would have shown unbelievably fast timings (because only the method(iterator)
code would have been timed in a loop).
FWIW, pypy will be even harder to time because overly simple loops get optimized away completely.
[Answer to the question as edited] You new timings are comparable (a good improvement) but the results still show the combination of the setup time and the loop time making it difficult to see differences that might be significant. Your expectation should be that lists and tuples beat sets because sets have to do more work (hashing each of the inputs rather than simply storing them).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With