Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

faster membership testing in python than set()

Tags:

I have to check presence of millions of elements (20-30 letters str) in the list containing 10-100k of those elements. Is there faster way of doing that in python than set() ?

import sys #load ids ids = set( x.strip() for x in open(idfile) )  for line in sys.stdin:     id=line.strip()     if id in ids:         #print fastq         print id         #update ids         ids.remove( id ) 
like image 296
Leszek Avatar asked Aug 18 '11 15:08

Leszek


People also ask

What is faster than set in Python?

Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1).

Are set operations faster in Python?

set is as fast as it gets. However, if you rewrite your code to create the set once, and not change it, you can use the frozenset built-in type. It's exactly the same except immutable.

Are sets slow in Python?

In Python 2, set is always the slowest. or is the fastest for 2 to 3 literals, and tuple and list are faster with 4 or more literals.

Is set faster than tuple?

for testing membership of elements at the beginning of the collection, lists or tuples perform more than 100% better than sets 2.


2 Answers

set is as fast as it gets.

However, if you rewrite your code to create the set once, and not change it, you can use the frozenset built-in type. It's exactly the same except immutable.

If you're still having speed problems, you need to speed your program up in other ways, such as by using PyPy instead of cPython.

like image 92
agf Avatar answered Oct 22 '22 18:10

agf


As I noted in my comment, what's probably slowing you down is that you're sequentially checking each line from sys.stdin for membership of your 'master' set. This is going to be really, really slow, and doesn't allow you to make use of the speed of set operations. As an example:

#!/usr/bin/env python  import random  # create two million-element sets of random numbers a = set(random.sample(xrange(10000000),1000000)) b = set(random.sample(xrange(10000000),1000000)) # a intersection b c = a & b # a difference c d = list(a - c)  print "set d is all remaining elements in a not common to a intersection b" print "length of d is %s" % len(d) 

The above runs in ~6 wallclock seconds on my five year-old machine, and it's testing for membership in larger sets than you require (unless I've misunderstood you). Most of that time is actually taken up creating the sets, so you won't even have that overhead. The fact that the strings you refer to are long isn't relevant here; creating a set creates a hash table, as agf explained. I suspect (though again, it's not clear from your question) that if you can get all your input data into a set before you do any membership testing, it'll be a lot faster, as opposed to reading it in one item at a time, then checking for set membership

like image 20
urschrei Avatar answered Oct 22 '22 20:10

urschrei