Splitting on white-space can be done in two ways with str.strip
. You can either issue a call with no arguments, str.strip()
, which defaults to using a white-space delimiter or explicitly supply the argument yourself with str.strip(' ')
.
But, why is it that when timed these functions perform so differently?
Using a sample string with an intentional amount of white spaces:
s = " " * 100 + 'a' + " " * 100
The timings for s.strip()
and s.strip(' ')
are respectively:
%timeit s.strip() The slowest run took 32.74 times longer than the fastest. This could mean that an intermediate result is being cached. 1000000 loops, best of 3: 396 ns per loop %timeit s.strip(' ') 100000 loops, best of 3: 4.5 µs per loop
strip
takes 396ns
while strip(' ')
takes 4.5 μs
, a similar scenario is present with rstrip
and lstrip
under the same conditions. Also, bytes objects
seem do be affected too.
The timings were performed for Python 3.5.2
, on Python 2.7.1
the difference is less drastic. The docs on str.strip
don't indicate anything useful, so, why does this happen?
The string strip() method in python is built-in from Python. It helps the developer to remove the whitespaces or specific characters from the string at the beginning and end of the string. Strip() method in string accepts only one parameter which is optional and has characters.
The strip() method removes characters from both left and right based on the argument (a string specifying the set of characters to be removed). Note: If the chars argument is not provided, all leading and trailing whitespaces are removed from the string.
It is the same as the previous one, but it has the extra memchr(sep, Py_CHARMASK(s[j]), seplen) check every time. So, the time complexity of this becomes O(N * M), where M is the length of the actual string of characters to be stripped.
This is because two functions exist for the two different cases, as can be seen in unicode_strip
; do_strip
and _PyUnicodeXStrip
the first executing much faster than the second.
Function do_strip
is for the common case str.strip()
where no arguments exist and do_argstrip
(which wraps _PyUnicode_XStrip
) for the case where str.strip(arg)
is called, i.e arguments are provided.
do_argstrip
just checks the separator and if it is valid and not equal to None
(in which case it calls do_strip
) it calls _PyUnicode_XStrip
.
Both do_strip
and _PyUnicode_XStrip
follow the same logic, two counters are used, one equal to zero and the other equal to the length of the string.
Using two while
loops, the first counter is incremented until a value not equal to the separator is reached and the second counter is decremented until the same condition is met.
The difference lies in the way checking if the current character is not equal to the separator is performed.
do_strip
:In the most common case where the characters in the string to be split can be represented in ascii
an additional small performance boost is present.
while (i < len) { Py_UCS1 ch = data[i]; if (!_Py_ascii_whitespace[ch]) break; i++; }
Py_UCS1 ch = data[i];
_Py_ascii_whitespace[ch]
. So, in short, it is quite efficient.
If the characters are not in the ascii
range, the differences aren't that drastic but they do slow the overall execution down:
while (i < len) { Py_UCS4 ch = PyUnicode_READ(kind, data, i); if (!Py_UNICODE_ISSPACE(ch)) break; i++; }
Py_UCS4 ch = PyUnicode_READ(kind, data, i);
Py_UNICODE_ISSPACE(ch)
macro (which simply calls another macro: Py_ISSPACE
)_PyUnicodeXStrip
:For this case, accessing the underlying data is, as it was in the previous case, done with PyUnicode_Read
; the check, on the other hand, to see if the character is a white-space (or really, any character we've provided) is reasonably a bit more complex.
while (i < len) { Py_UCS4 ch = PyUnicode_READ(kind, data, i); if (!BLOOM(sepmask, ch)) break; if (PyUnicode_FindChar(sepobj, ch, 0, seplen, 1) < 0) break; i++; }
PyUnicode_FindChar
is used, which, although efficient, is much more complex and slow compared to an array access. For each character in the string it is called to see if that character is contained in the separator(s) we've provided. As the length of the string increases, so does the overhead introduced by calling this function continuously.
For those interested, PyUnicode_FindChar
after quite some checks, will eventually call find_char
inside stringlib
which in the case where the length of the separators is < 10
will loop until it finds the character.
Apart from this, consider the additional functions that need to already be called in order to get here.
As for lstrip
and rstrip
, the situation is similar. Flags for which mode of striping to perform exist, namely: RIGHTSTRIP
for rstrip
, LEFTSTRIP
for lstrip
and BOTHSTRIP
for strip
. The logic inside do_strip
and _PyUnicode_XStrip
is performed conditionally based on the flag.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With