Why is the implementation of startwith
slower than slicing?
In [1]: x = 'foobar'
In [2]: y = 'foo'
In [3]: %timeit x.startswith(y)
1000000 loops, best of 3: 321 ns per loop
In [4]: %timeit x[:3] == y
10000000 loops, best of 3: 164 ns per loop
Surprisingly, even including calculation for the length, slicing still appears significantly faster:
In [5]: %timeit x[:len(y)] == y
1000000 loops, best of 3: 251 ns per loop
Note: the first part of this behaviour is noted in Python for Data Analysis (Chapter 3), but no explanation for it is offered.
.
If helpful: here is the C code for startswith
; and here is the output of dis.dis
:
In [6]: import dis
In [7]: dis_it = lambda x: dis.dis(compile(x, '<none>', 'eval'))
In [8]: dis_it('x[:3]==y')
1 0 LOAD_NAME 0 (x)
3 LOAD_CONST 0 (3)
6 SLICE+2
7 LOAD_NAME 1 (y)
10 COMPARE_OP 2 (==)
13 RETURN_VALUE
In [9]: dis_it('x.startswith(y)')
1 0 LOAD_NAME 0 (x)
3 LOAD_ATTR 1 (startswith)
6 LOAD_NAME 2 (y)
9 CALL_FUNCTION 1
12 RETURN_VALUE
STARTSWITH is a string manipulation function that manipulates all string data types (BIT, BLOB, and CHARACTER), and returns a Boolean value to indicate whether one string begins with another.
The startswith() method returns True if the string starts with the specified value, otherwise False.
The startswith() search is case-sensitive, as shown below. The start and end parameters limit the checking of a prefix in a string as indexes.
There are two built-in methods in Python to do the task. These are startswith() and endswith() methods. If any string starts with a given prefix then startswith() method will return true otherwise returns false and if any string ending with a given suffix then endswith() method will return true otherwise returns false.
Some of the performance difference can be explained by taking into account the time it takes the .
operator to do its thing:
>>> x = 'foobar'
>>> y = 'foo'
>>> sw = x.startswith
>>> %timeit x.startswith(y)
1000000 loops, best of 3: 316 ns per loop
>>> %timeit sw(y)
1000000 loops, best of 3: 267 ns per loop
>>> %timeit x[:3] == y
10000000 loops, best of 3: 151 ns per loop
Another portion of the difference can be explained by the fact that startswith
is a function, and even no-op function calls take a bit of time:
>>> def f():
... pass
...
>>> %timeit f()
10000000 loops, best of 3: 105 ns per loop
This does not totally explain the difference, since the version using slicing and len
calls a function and is still faster (compare to sw(y)
above -- 267 ns):
>>> %timeit x[:len(y)] == y
1000000 loops, best of 3: 213 ns per loop
My only guess here is that maybe Python optimizes lookup time for built-in functions, or that len
calls are heavily optimized (which is probably true). It might be possible to test that with a custom len
func. Or possibly this is where the differences identified by LastCoder kick in. Note also larsmans' results, which indicate that startswith
is actually faster for longer strings. The whole line of reasoning above applies only to those cases where the overhead I'm talking about actually matters.
The comparison isn't fair since you're only measuring the case where startswith
returns True
.
>>> x = 'foobar'
>>> y = 'fool'
>>> %timeit x.startswith(y)
1000000 loops, best of 3: 221 ns per loop
>>> %timeit x[:3] == y # note: length mismatch
10000000 loops, best of 3: 122 ns per loop
>>> %timeit x[:4] == y
10000000 loops, best of 3: 158 ns per loop
>>> %timeit x[:len(y)] == y
1000000 loops, best of 3: 210 ns per loop
>>> sw = x.startswith
>>> %timeit sw(y)
10000000 loops, best of 3: 176 ns per loop
Also, for much longer strings, startswith
is a lot faster:
>>> import random
>>> import string
>>> x = '%030x' % random.randrange(256**10000)
>>> len(x)
20000
>>> y = r[:4000]
>>> %timeit x.startswith(y)
1000000 loops, best of 3: 211 ns per loop
>>> %timeit x[:len(y)] == y
1000000 loops, best of 3: 469 ns per loop
>>> sw = x.startswith
>>> %timeit sw(y)
10000000 loops, best of 3: 168 ns per loop
This is still true when there's no match.
# change last character of y
>>> y = y[:-1] + chr((ord(y[-1]) + 1) % 256)
>>> %timeit x.startswith(y)
1000000 loops, best of 3: 210 ns per loop
>>> %timeit x[:len(y)] == y
1000000 loops, best of 3: 470 ns per loop
>>> %timeit sw(y)
10000000 loops, best of 3: 168 ns per loop
# change first character of y
>>> y = chr((ord(y[0]) + 1) % 256) + y[1:]
>>> %timeit x.startswith(y)
1000000 loops, best of 3: 210 ns per loop
>>> %timeit x[:len(y)] == y
1000000 loops, best of 3: 442 ns per loop
>>> %timeit sw(y)
10000000 loops, best of 3: 168 ns per loop
So, startswith
is probably slower for short strings because it's optimized for long ones.
(Trick to get random strings taken from this answer.)
startswith
is more complex than slicing...
2924 result = _string_tailmatch(self,
2925 PyTuple_GET_ITEM(subobj, i),
2926 start, end, -1);
This isn't a simple character compare loop for needle in beginning of haystack that's happening. We're looking at a for loop that is iterating through a vector/tuple (subobj) and calling another function (_string_tailmatch
) on it. Multiple function calls have overhead with regards to the stack, argument sanity checks etc...
startswith
is a library function while the slicing appears to be built into the language.
2919 if (!stringlib_parse_args_finds("startswith", args, &subobj, &start, &end))
2920 return NULL;
To quote the docs, startswith
does more you might think:
str.startswith(prefix[, start[, end]])
Return
True
if string starts with the prefix, otherwise returnFalse
. prefix can also be a tuple of prefixes to look for. With optional start, test string beginning at that position. With optional end, stop comparing string at that position.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With