I was under the impression that startswith
has to be faster than in
for the simple reason that in
has to do more checks (allows for the word being looked for to be anywhere in the string). But I had my doubts so I decided to timeit
. The code for the timings is given below and as you will probably notice I haven't done much timing; the code is rather simple.
import timeit
setup1='''
def in_test(sent, word):
if word in sent:
return True
else:
return False
'''
setup2='''
def startswith_test(sent, word):
if sent.startswith(word):
return True
else:
return False
'''
print(timeit.timeit('in_test("this is a standard sentence", "this")', setup=setup1))
print(timeit.timeit('startswith_test("this is a standard sentence", "this")', setup=setup2))
Results:
>> in: 0.11912814951705597
>> startswith: 0.22812353561129417
So startswith
is twice as slow!.. I find this behavior very puzzling given what I said further above. Am I doing something wrong with timing the two or is in
indeed faster? If so, why?
Note that the results are very similar even when they both return False
(in this case, in
would have to actually traverse the whole sentece in case it simply short-circuited before):
print(timeit.timeit('in_test("another standard sentence, would be that", "this")', setup=setup1))
print(timeit.timeit('startswith_test("another standard sentence, would be that", "this")', setup=setup2))
>> in: 0.12854891578786237
>> startswith: 0.2233201940338861
If I had to implement the two functions from scratch it would look something like this (pseudocode):
startswith
: start comparing the letters of word to the letters of sentence one by one until a) word gets depleted (return True) or b) check returns False (return False)
in
: call startswith
for every position where the initial letter of word can be found in sentence.
I just don't get it..
Just to make it clear, in
and startswith
are not equivallent; I am just talking about the case where the word one is trying to find has to be the first in a string.
You're comparing an operator on strings -vs- an attribute lookup and a function call. The second one will have a higher overhead, even if the first one takes a long time on a lot of data.
Additionally you're looking for the first word, so if it does match, in
will look at just as much data as startswith()
. To see the difference you should look at a pessimistic case (no results found, or match at the end of the string):
setup1='''
data = "xxxx"*1000
def ....
print(timeit.timeit('in_test(data, "this")', setup=setup1))
0.932795189000899
print(timeit.timeit('startswith_test(data, "this")', setup=setup2))
0.22242475600069156
This is due to the fact that you have to look-up and invoke a method. in
is specialized and leads directly to COMPARE_OP
(calling cmp_outcome
which, in turn, calls PySequence_Contains
) while str.startswith
goes through slower byte-code:
2 LOAD_ATTR 0 (startswith)
4 LOAD_FAST 1 (word)
6 CALL_FUNCTION 1 # the slow part
Replacing in
with __contains__
, forcing a function call for that case too, pretty much negates the speed difference:
setup1='''
def in_test(sent, word):
if sent.__contains__(word):
return True
else:
return False
'''
And, the timings:
print(timeit.timeit('in_test("this is a standard sentence", "this")', setup=setup1))
print(timeit.timeit('startswith_test("this is a standard sentence", "this")', setup=setup2))
0.43849368393421173
0.4993997460696846
in
is winning here because of the fact that it doesn't need to go through the whole function call setup and due to the favorable case it's presented with.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With