Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing the speed of startswith() .vs. in() [duplicate]

I was under the impression that startswith has to be faster than in for the simple reason that in has to do more checks (allows for the word being looked for to be anywhere in the string). But I had my doubts so I decided to timeit. The code for the timings is given below and as you will probably notice I haven't done much timing; the code is rather simple.

import timeit

setup1='''
def in_test(sent, word):
    if word in sent:
        return True
    else:
        return False
'''

setup2='''
def startswith_test(sent, word):
    if sent.startswith(word):
        return True
    else:
        return False
'''

print(timeit.timeit('in_test("this is a standard sentence", "this")', setup=setup1))
print(timeit.timeit('startswith_test("this is a standard sentence", "this")', setup=setup2))

Results:

>> in:         0.11912814951705597
>> startswith: 0.22812353561129417

So startswith is twice as slow!.. I find this behavior very puzzling given what I said further above. Am I doing something wrong with timing the two or is in indeed faster? If so, why?

Note that the results are very similar even when they both return False (in this case, in would have to actually traverse the whole sentece in case it simply short-circuited before):

print(timeit.timeit('in_test("another standard sentence, would be that", "this")', setup=setup1))
print(timeit.timeit('startswith_test("another standard sentence, would be that", "this")', setup=setup2))

>> in:         0.12854891578786237
>> startswith: 0.2233201940338861

If I had to implement the two functions from scratch it would look something like this (pseudocode):

startswith: start comparing the letters of word to the letters of sentence one by one until a) word gets depleted (return True) or b) check returns False (return False)

in: call startswith for every position where the initial letter of word can be found in sentence.

I just don't get it..


Just to make it clear, in and startswith are not equivallent; I am just talking about the case where the word one is trying to find has to be the first in a string.

like image 531
Ma0 Avatar asked Jun 13 '17 11:06

Ma0


Video Answer


2 Answers

You're comparing an operator on strings -vs- an attribute lookup and a function call. The second one will have a higher overhead, even if the first one takes a long time on a lot of data.

Additionally you're looking for the first word, so if it does match, in will look at just as much data as startswith(). To see the difference you should look at a pessimistic case (no results found, or match at the end of the string):

setup1='''
data = "xxxx"*1000
def ....

print(timeit.timeit('in_test(data, "this")', setup=setup1))
0.932795189000899
print(timeit.timeit('startswith_test(data, "this")', setup=setup2))
0.22242475600069156
like image 132
viraptor Avatar answered Oct 04 '22 13:10

viraptor


This is due to the fact that you have to look-up and invoke a method. in is specialized and leads directly to COMPARE_OP (calling cmp_outcome which, in turn, calls PySequence_Contains) while str.startswith goes through slower byte-code:

2 LOAD_ATTR                0 (startswith)
4 LOAD_FAST                1 (word)
6 CALL_FUNCTION            1              # the slow part

Replacing in with __contains__, forcing a function call for that case too, pretty much negates the speed difference:

setup1='''
def in_test(sent, word):
    if sent.__contains__(word):
        return True
    else:
        return False
'''

And, the timings:

print(timeit.timeit('in_test("this is a standard sentence", "this")', setup=setup1))
print(timeit.timeit('startswith_test("this is a standard sentence", "this")', setup=setup2))
0.43849368393421173
0.4993997460696846

in is winning here because of the fact that it doesn't need to go through the whole function call setup and due to the favorable case it's presented with.

like image 31
Dimitris Fasarakis Hilliard Avatar answered Oct 04 '22 12:10

Dimitris Fasarakis Hilliard