I need to read gigabytes of text so I'm trying to optimize my code. When doing this I found that, for my problem, using a dictionary is faster than if-tests.
check = {'R':'-', 'F':'+'}
seqs = ['R', 'F']*100
def check1():
for entry in seqs:
if entry == 'R':
strand = '-'
if entry == 'F':
strand = '+'
def check2():
for entry in seqs:
strand = check[entry]
Using ipythong's %timeit I see that looking up in a dictionary is slightly more than twice as fast as using two if-tests:
In [63]: %timeit check1()
10000 loops, best of 3: 38.8 us per loop
In [64]: %timeit check2()
100000 loops, best of 3: 16.2 us per loop
Since if-tests are so basic, I did not expect a performance difference. Is this well known? Can anybody explain why this is so?
UPDATE
I checked how the two functions above as well as check3() below affect the runtime of my actual code, and there's no effect on the total time. So either the boost gotten with the dictionary is not so high in a real-world example where the 'R' and 'F' values need to be re-read from file constantly, or this piece of code is just not part of my bottleneck.
Anyway thanks for the answers!
You haven't actually proved that looking up in a dictionary is faster than two if
tests. What you have shown is that looking up in that particular dictionary is faster than those two tests.
Normally dictionary lookup needs a few steps: generate a hash from the key to find a potential match, then test the potential match by comparing the keys. Sometimes it may have to do multiple comparisons if there are hash table collisions. If you have user defined classes for the keys then both of these steps could be slow, they are generally fast for strings but in one particular case they are really really fast, and you've hit that case.
Your dictionary uses keys that are short strings that match the format for identifiers known at compile time. Python will helpfully 'intern' your strings 'R' and 'F'. Since the strings you use in your tests are also known at compile time they will be the exact same instances. What all this means for dictionary lookup is that a specialised version of the lookup is used for a dictionary which has only string keys, the hash has always been precomputed, and key comparison is done by comparing addresses (at least when it succeeds and with your two keys it will never fail).
Your real code will, I assume be reading strings from the input, so it won't have the interned copy of 'R'. That means it will need to compute the hash for each line of input. The addresses won't match so it will have to call the string comparison function for each test. You do still get some optimisations for having only string keys and at least it doesn't have to do the general purpose comparison for objects which might not be strings.
The if
statements know nothing about the object types so they do do a general purpose comparison every time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With