Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is str.strip() so much faster than str.strip(' ')?

Splitting on white-space can be done in two ways with str.strip. You can either issue a call with no arguments, str.strip(), which defaults to using a white-space delimiter or explicitly supply the argument yourself with str.strip(' ').

But, why is it that when timed these functions perform so differently?

Using a sample string with an intentional amount of white spaces:

s = " " * 100 + 'a' + " " * 100 

The timings for s.strip() and s.strip(' ') are respectively:

%timeit s.strip() The slowest run took 32.74 times longer than the fastest. This could mean that an intermediate result is being cached. 1000000 loops, best of 3: 396 ns per loop  %timeit s.strip(' ') 100000 loops, best of 3: 4.5 µs per loop 

strip takes 396ns while strip(' ') takes 4.5 μs, a similar scenario is present with rstrip and lstrip under the same conditions. Also, bytes objects seem do be affected too.

The timings were performed for Python 3.5.2, on Python 2.7.1 the difference is less drastic. The docs on str.strip don't indicate anything useful, so, why does this happen?

like image 403
Dimitris Fasarakis Hilliard Avatar asked Jul 09 '16 19:07

Dimitris Fasarakis Hilliard


People also ask

What is the purpose of the strip () method for strings?

The string strip() method in python is built-in from Python. It helps the developer to remove the whitespaces or specific characters from the string at the beginning and end of the string. Strip() method in string accepts only one parameter which is optional and has characters.

What does Strip () do in Python?

The strip() method removes characters from both left and right based on the argument (a string specifying the set of characters to be removed). Note: If the chars argument is not provided, all leading and trailing whitespaces are removed from the string.

What is the time complexity of strip Python?

It is the same as the previous one, but it has the extra memchr(sep, Py_CHARMASK(s[j]), seplen) check every time. So, the time complexity of this becomes O(N * M), where M is the length of the actual string of characters to be stripped.


1 Answers

In a tl;dr fashion:

This is because two functions exist for the two different cases, as can be seen in unicode_strip; do_strip and _PyUnicodeXStrip the first executing much faster than the second.

Function do_strip is for the common case str.strip() where no arguments exist and do_argstrip (which wraps _PyUnicode_XStrip) for the case where str.strip(arg) is called, i.e arguments are provided.


do_argstrip just checks the separator and if it is valid and not equal to None (in which case it calls do_strip) it calls _PyUnicode_XStrip.

Both do_strip and _PyUnicode_XStrip follow the same logic, two counters are used, one equal to zero and the other equal to the length of the string.

Using two while loops, the first counter is incremented until a value not equal to the separator is reached and the second counter is decremented until the same condition is met.

The difference lies in the way checking if the current character is not equal to the separator is performed.

For do_strip:

In the most common case where the characters in the string to be split can be represented in ascii an additional small performance boost is present.

while (i < len) {     Py_UCS1 ch = data[i];     if (!_Py_ascii_whitespace[ch])         break;     i++; } 
  • Accessing the current character in the data is made quickly with by accessing the underlying array: Py_UCS1 ch = data[i];
  • The check if a character is a white-space is made by a simple array index into an array called _Py_ascii_whitespace[ch].

So, in short, it is quite efficient.

If the characters are not in the ascii range, the differences aren't that drastic but they do slow the overall execution down:

while (i < len) {     Py_UCS4 ch = PyUnicode_READ(kind, data, i);     if (!Py_UNICODE_ISSPACE(ch))         break;     i++; } 
  • Accessing is done with Py_UCS4 ch = PyUnicode_READ(kind, data, i);
  • Checking if the character is whitespace is done by the Py_UNICODE_ISSPACE(ch) macro (which simply calls another macro: Py_ISSPACE)

For _PyUnicodeXStrip:

For this case, accessing the underlying data is, as it was in the previous case, done with PyUnicode_Read; the check, on the other hand, to see if the character is a white-space (or really, any character we've provided) is reasonably a bit more complex.

while (i < len) {      Py_UCS4 ch = PyUnicode_READ(kind, data, i);      if (!BLOOM(sepmask, ch))          break;      if (PyUnicode_FindChar(sepobj, ch, 0, seplen, 1) < 0)          break;      i++; } 

PyUnicode_FindChar is used, which, although efficient, is much more complex and slow compared to an array access. For each character in the string it is called to see if that character is contained in the separator(s) we've provided. As the length of the string increases, so does the overhead introduced by calling this function continuously.

For those interested, PyUnicode_FindChar after quite some checks, will eventually call find_char inside stringlib which in the case where the length of the separators is < 10 will loop until it finds the character.

Apart from this, consider the additional functions that need to already be called in order to get here.


As for lstrip and rstrip, the situation is similar. Flags for which mode of striping to perform exist, namely: RIGHTSTRIP for rstrip, LEFTSTRIP for lstrip and BOTHSTRIP for strip. The logic inside do_strip and _PyUnicode_XStrip is performed conditionally based on the flag.

like image 145
Dimitris Fasarakis Hilliard Avatar answered Oct 14 '22 12:10

Dimitris Fasarakis Hilliard