Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is creating a set from a filter so much faster than creating a list or a tuple?

I’m running filter on an interable and want to store the result in a sequence (I need a sequence so that I can use random.choice on it). I noticed that creating a set from a filter object is a lot faster than creating a list or a tuple. Why is that? I first though that the filter type is a subtype of set, which would explain this, but the filter function is actually identical to a generator expression, so it cannot really be a set internally.

I ran the following test to check the speed:

import time

def test ( n, seq ):
    for method in ( set, list, tuple ):
        t = time.time()
        for i in range( n ):
            method( seq )
        print( method.__name__, ( time.time() - t ) )

someFilter = filter( lambda x: x % 3 == 0, range( 1000 ) )
test( 10000000, someFilter )

And the results were speaking clearly for using a set:

set 1.9240000247955322
list 8.82200002670288
tuple 7.031999826431274

So why is creating a set from a filter so much faster? Shouldn’t it usually take as long as creating a set from a sequence, where every element has to be hashed? Or is it somehow getting a boost from the internal filter representation?

For comparison, when running the test on a range expression, set takes about twice as long as list and tuple (which are both nearly identical in speed).

edit:

Sven’s answer is totally right, but for the completeness an updated test that will run on an actual filter:

import time

def testFilter ( n, test, rangeSize ):
    for method in ( set, list, tuple ):
        t = time.time()
        for i in range( n ):
            method( filter( test, range( rangeSize ) ) )
        print( method.__name__, ( time.time() - t ) )

testFilter( 100000, lambda x: x % 3 == 0, 1000 )

The result actually shows what makes more sense with list and tuple both being the fastest, although set isn’t really slower, so it won’t make any difference what to use:

set 27.868000030517578
list 27.131999969482422
tuple 27.138000011444092
like image 509
poke Avatar asked Nov 10 '11 23:11

poke


People also ask

Why is set so much faster than list?

Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1).

Why are lists slower than tuples?

Tuple is stored in a single block of memory. Creating a tuple is faster than creating a list. Creating a list is slower because two memory blocks need to be accessed. An element in a tuple cannot be removed or replaced.

Why is a set better than a list?

Because sets cannot have multiple occurrences of the same element, it makes sets highly useful to efficiently remove duplicate values from a list or tuple and to perform common math operations like unions and intersections.

Why tuple is more faster than list?

Tuples are stored in a single block of memory. Tuples are immutable so, It doesn't require extra space to store new objects. Lists are allocated in two blocks: the fixed one with all the Python object information and a variable-sized block for the data. It is the reason creating a tuple is faster than List.


2 Answers

filter() returns an iterator in Python 3, and this iterator will be consumed on the first run of the inner for-loop. After that, you are only measuring the speed of the contructor -- that's why you have to repeat it so often to make it consume at least a bit of time.

So it seems that the constructor of set() is the fastest one in dealing with an empty iterator.

like image 71
Sven Marnach Avatar answered Oct 19 '22 18:10

Sven Marnach


When timings suggest results that are illogical, often it is the timing suite itself that is at fault ;-)

Try using the timeit module which can help steer you clear of common timing mistakes. In particular, you want to run a fresh setup for each test and only time the loop body rather than the body plus the test code.

In this case, it would have at least made the timings comparable (all of them would have used a fresh iterator as returned by Python 3's version of *filter) and it would have shown unbelievably fast timings (because only the method(iterator) code would have been timed in a loop).

FWIW, pypy will be even harder to time because overly simple loops get optimized away completely.

[Answer to the question as edited] You new timings are comparable (a good improvement) but the results still show the combination of the setup time and the loop time making it difficult to see differences that might be significant. Your expectation should be that lists and tuples beat sets because sets have to do more work (hashing each of the inputs rather than simply storing them).

like image 23
Raymond Hettinger Avatar answered Oct 19 '22 16:10

Raymond Hettinger