Why is str.strip() so much faster than str.strip(' ')?

Tags:

Splitting on white-space can be done in two ways with str.strip. You can either issue a call with no arguments, str.strip(), which defaults to using a white-space delimiter or explicitly supply the argument yourself with str.strip(' ').

But, why is it that when timed these functions perform so differently?

Using a sample string with an intentional amount of white spaces:

s = " " * 100 + 'a' + " " * 100

The timings for s.strip() and s.strip(' ') are respectively:

%timeit s.strip() The slowest run took 32.74 times longer than the fastest. This could mean that an intermediate result is being cached. 1000000 loops, best of 3: 396 ns per loop  %timeit s.strip(' ') 100000 loops, best of 3: 4.5 µs per loop

strip takes 396ns while strip(' ') takes 4.5 μs, a similar scenario is present with rstrip and lstrip under the same conditions. Also, bytes objects seem do be affected too.

The timings were performed for Python 3.5.2, on Python 2.7.1 the difference is less drastic. The docs on str.strip don't indicate anything useful, so, why does this happen?

403

asked Jul 09 '16 19:07

Dimitris Fasarakis Hilliard

1 Answers

In a tl;dr fashion:

This is because two functions exist for the two different cases, as can be seen in unicode_strip; do_strip and _PyUnicodeXStrip the first executing much faster than the second.

Function do_strip is for the common case str.strip() where no arguments exist and do_argstrip (which wraps _PyUnicode_XStrip) for the case where str.strip(arg) is called, i.e arguments are provided.

do_argstrip just checks the separator and if it is valid and not equal to None (in which case it calls do_strip) it calls _PyUnicode_XStrip.

Both do_strip and _PyUnicode_XStrip follow the same logic, two counters are used, one equal to zero and the other equal to the length of the string.

Using two while loops, the first counter is incremented until a value not equal to the separator is reached and the second counter is decremented until the same condition is met.

The difference lies in the way checking if the current character is not equal to the separator is performed.

For `do_strip`:

In the most common case where the characters in the string to be split can be represented in ascii an additional small performance boost is present.

while (i < len) {     Py_UCS1 ch = data[i];     if (!_Py_ascii_whitespace[ch])         break;     i++; }

Accessing the current character in the data is made quickly with by accessing the underlying array: Py_UCS1 ch = data[i];
The check if a character is a white-space is made by a simple array index into an array called _Py_ascii_whitespace[ch].

So, in short, it is quite efficient.

If the characters are not in the ascii range, the differences aren't that drastic but they do slow the overall execution down:

while (i < len) {     Py_UCS4 ch = PyUnicode_READ(kind, data, i);     if (!Py_UNICODE_ISSPACE(ch))         break;     i++; }

Accessing is done with Py_UCS4 ch = PyUnicode_READ(kind, data, i);
Checking if the character is whitespace is done by the Py_UNICODE_ISSPACE(ch) macro (which simply calls another macro: Py_ISSPACE)

For `_PyUnicodeXStrip`:

For this case, accessing the underlying data is, as it was in the previous case, done with PyUnicode_Read; the check, on the other hand, to see if the character is a white-space (or really, any character we've provided) is reasonably a bit more complex.

while (i < len) {      Py_UCS4 ch = PyUnicode_READ(kind, data, i);      if (!BLOOM(sepmask, ch))          break;      if (PyUnicode_FindChar(sepobj, ch, 0, seplen, 1) < 0)          break;      i++; }

PyUnicode_FindChar is used, which, although efficient, is much more complex and slow compared to an array access. For each character in the string it is called to see if that character is contained in the separator(s) we've provided. As the length of the string increases, so does the overhead introduced by calling this function continuously.

For those interested, PyUnicode_FindChar after quite some checks, will eventually call find_char inside stringlib which in the case where the length of the separators is < 10 will loop until it finds the character.

Apart from this, consider the additional functions that need to already be called in order to get here.

As for lstrip and rstrip, the situation is similar. Flags for which mode of striping to perform exist, namely: RIGHTSTRIP for rstrip, LEFTSTRIP for lstrip and BOTHSTRIP for strip. The logic inside do_strip and _PyUnicode_XStrip is performed conditionally based on the flag.

145

answered Oct 14 '22 12:10

Dimitris Fasarakis Hilliard

Related questions
                            
                                Changing image hue with Python PIL
                            
                                Does Python go well with QML (Qt-Quick)?
                            
                                Pythonic way to iterate through a range starting at 1
                            
                                Python ConfigParser.NoSectionError: No section:
                            
                                What does the --pre option in pip signify?
                            
                                What are the differences between setUpClass, setUpTestData and setUp in TestCase class?
                            
                                Setting SECURE_HSTS_SECONDS can irreversibly break your site?
                            
                                how to get tz_info object corresponding to current timezone?
                            
                                Is there any adequate scaffolding for Django? (à la Ruby on Rails)
                            
                                Using Django Managers vs. staticmethod on Model class directly
                            
                                Preventing django from appending "_id" to a foreign key field
                            
                                How to break out of while loop in Python?
                            
                                How can I send variables to Jinja template from a Flask decorator?
                            
                                raise with no argument
                            
                                How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup? [duplicate]
                            
                                How can I execute Python scripts using Anaconda's version of Python?
                            
                                Clean code for sequence of map/filter/reduce functions
                            
                                Methods for writing Parquet files using Python?
                            
                                Fast arbitrary distribution random sampling (inverse transform sampling)
                            
                                Why is this TensorFlow implementation vastly less successful than Matlab's NN?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is str.strip() so much faster than str.strip(' ')?

Tags:

performance

python

string

python-3.x

python-internals

Dimitris Fasarakis Hilliard

People also ask

1 Answers

In a tl;dr fashion:

For `do_strip`:

For `_PyUnicodeXStrip`:

Dimitris Fasarakis Hilliard

Recent Activity

Donate For Us

Why is str.strip() so much faster than str.strip(' ')?

Tags:

performance

python

string

python-3.x

python-internals

Dimitris Fasarakis Hilliard

People also ask

1 Answers

In a tl;dr fashion:

For do_strip:

For _PyUnicodeXStrip:

Dimitris Fasarakis Hilliard

Related questions

Recent Activity

Donate For Us

For `do_strip`:

For `_PyUnicodeXStrip`: