I have to check presence of millions of elements (20-30 letters str) in the list containing 10-100k of those elements. Is there faster way of doing that in python than set()
?
import sys #load ids ids = set( x.strip() for x in open(idfile) ) for line in sys.stdin: id=line.strip() if id in ids: #print fastq print id #update ids ids.remove( id )
Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1).
set is as fast as it gets. However, if you rewrite your code to create the set once, and not change it, you can use the frozenset built-in type. It's exactly the same except immutable.
In Python 2, set is always the slowest. or is the fastest for 2 to 3 literals, and tuple and list are faster with 4 or more literals.
for testing membership of elements at the beginning of the collection, lists or tuples perform more than 100% better than sets 2.
set
is as fast as it gets.
However, if you rewrite your code to create the set
once, and not change it, you can use the frozenset
built-in type. It's exactly the same except immutable.
If you're still having speed problems, you need to speed your program up in other ways, such as by using PyPy instead of cPython.
As I noted in my comment, what's probably slowing you down is that you're sequentially checking each line from sys.stdin
for membership of your 'master' set. This is going to be really, really slow, and doesn't allow you to make use of the speed of set operations. As an example:
#!/usr/bin/env python import random # create two million-element sets of random numbers a = set(random.sample(xrange(10000000),1000000)) b = set(random.sample(xrange(10000000),1000000)) # a intersection b c = a & b # a difference c d = list(a - c) print "set d is all remaining elements in a not common to a intersection b" print "length of d is %s" % len(d)
The above runs in ~6 wallclock seconds on my five year-old machine, and it's testing for membership in larger sets than you require (unless I've misunderstood you). Most of that time is actually taken up creating the sets, so you won't even have that overhead. The fact that the strings you refer to are long isn't relevant here; creating a set creates a hash table, as agf explained. I suspect (though again, it's not clear from your question) that if you can get all your input data into a set before you do any membership testing, it'll be a lot faster, as opposed to reading it in one item at a time, then checking for set membership
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With