I have to check presence of millions of elements (20-30 letters str) in the list containing 10-100k of those elements. Is there faster way of doing that in python than <code>set()</code> ? <pre class="prettyprint"><code>import sys #load ids ids = set( x.strip() for x in open(idfile) ) for line in sys.stdin: id=line.strip() if id in ids: #print fastq print id #update ids ids.remove( id ) </code></pre>

<code>set</code> is as fast as it gets. However, if you rewrite your code to create the <code>set</code> once, and not change it, you can use the <code>frozenset</code> built-in type. It's exactly the same except immutable. If you're still having speed problems, you need to speed your program up in other ways, such as by using PyPy instead of cPython.

As I noted in my comment, what's probably slowing you down is that you're sequentially checking each line from <code>sys.stdin</code> for membership of your 'master' set. This is going to be really, really slow, and doesn't allow you to make use of the speed of set operations. As an example: <pre class="prettyprint"><code>#!/usr/bin/env python import random # create two million-element sets of random numbers a = set(random.sample(xrange(10000000),1000000)) b = set(random.sample(xrange(10000000),1000000)) # a intersection b c = a & b # a difference c d = list(a - c) print "set d is all remaining elements in a not common to a intersection b" print "length of d is %s" % len(d) </code></pre> The above runs in ~6 wallclock seconds on my five year-old machine, and it's testing for membership in larger sets than you require (unless I've misunderstood you). Most of that time is actually taken up creating the sets, so you won't even have that overhead. The fact that the strings you refer to are long isn't relevant here; creating a set creates a hash table, as agf explained. I suspect (though again, it's not clear from your question) that if you can get all your input data into a set before you do any membership testing, it'll be a lot faster, as opposed to reading it in one item at a time, then checking for set membership

faster membership testing in python than set()

Tags:

I have to check presence of millions of elements (20-30 letters str) in the list containing 10-100k of those elements. Is there faster way of doing that in python than set() ?

import sys #load ids ids = set( x.strip() for x in open(idfile) )  for line in sys.stdin:     id=line.strip()     if id in ids:         #print fastq         print id         #update ids         ids.remove( id )

296

asked Aug 18 '11 15:08

Leszek

2 Answers

set is as fast as it gets.

However, if you rewrite your code to create the set once, and not change it, you can use the frozenset built-in type. It's exactly the same except immutable.

If you're still having speed problems, you need to speed your program up in other ways, such as by using PyPy instead of cPython.

answered Oct 22 '22 18:10

agf

As I noted in my comment, what's probably slowing you down is that you're sequentially checking each line from sys.stdin for membership of your 'master' set. This is going to be really, really slow, and doesn't allow you to make use of the speed of set operations. As an example:

#!/usr/bin/env python  import random  # create two million-element sets of random numbers a = set(random.sample(xrange(10000000),1000000)) b = set(random.sample(xrange(10000000),1000000)) # a intersection b c = a & b # a difference c d = list(a - c)  print "set d is all remaining elements in a not common to a intersection b" print "length of d is %s" % len(d)

The above runs in ~6 wallclock seconds on my five year-old machine, and it's testing for membership in larger sets than you require (unless I've misunderstood you). Most of that time is actually taken up creating the sets, so you won't even have that overhead. The fact that the strings you refer to are long isn't relevant here; creating a set creates a hash table, as agf explained. I suspect (though again, it's not clear from your question) that if you can get all your input data into a set before you do any membership testing, it'll be a lot faster, as opposed to reading it in one item at a time, then checking for set membership

answered Oct 22 '22 20:10

urschrei

Related questions
                            
                                What happens when I assign a negative value to an unsigned int? [duplicate]
                            
                                What is the proxy meaning in EntityFramework?
                            
                                How to implement different levels for specific modules in Python
                            
                                Nesting with three levels of quotations
                            
                                How would I do this iOS animation on OSX?
                            
                                Folding, function composition, monads, and laziness, oh my?
                            
                                How to set session attribute in java? [duplicate]
                            
                                has_many :through with counter_cache
                            
                                How to enable horizontal scrollbar to pop when ‘nowrap’ is set in Vim?
                            
                                How do I get the difference between two dates under bash [duplicate]
                            
                                App_Offline in MSBuild Remote Web Deploy
                            
                                Why does python use both reference counting and mark-and-sweep for gc?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With