Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of str.strip for Pandas

I thought the third option was supposed to be the fastest way to strip whitespaces? Can someone give me some general rules that I should be applying when working with large data sets? I normally use .astype(str) but clearly that is not worthwhile for columns which I know are objects already.

%%timeit
fcr['id'] = fcr['id'].astype(str).map(str.strip)
10 loops, best of 3: 47.8 ms per loop

%%timeit
fcr['id'] = fcr['id'].map(str.strip)
10 loops, best of 3: 25.2 ms per loop

%%timeit
fcr['id'] = fcr['id'].str.strip(' ')
10 loops, best of 3: 55.5 ms per loop
like image 510
LeviJr Canlas Avatar asked Jan 18 '16 19:01

LeviJr Canlas


People also ask

How do you strip a string in Pandas?

lstrip() is used to remove spaces from the left side of string, str. rstrip() to remove spaces from right side of the string and str. strip() removes spaces from both sides.

Does Pandas use lazy evaluation?

Polars supports eager evaluation and lazy evaluation whereas Pandas only supports eager evaluation.

What does STR do in Pandas?

str can be used to access the values of the series as strings and apply several methods to it. Pandas Series. str. contains() function is used to test if pattern or regex is contained within a string of a Series or Index.


1 Answers

Let's first look at the difference between .map(str.strip) and .str.strip() (second and third case).
Therefore, you need to understand what str.strip() does under the hood: it actually does some map(str.strip), but using a custom map function that will handle missing values.
So given that .str.strip() does more than .map(str.strip), it is to be expected that this method will always be slower (and as you have shown, in your case 2x slower).

Using the .str.strip() method has it advantages in the automatic NaN handling (or handling of other non-string values). Suppose the 'id' column contains a NaN value:

In [4]: df['id'].map(str.strip)
...
TypeError: descriptor 'strip' requires a 'str' object but received a 'float'

In [5]: df['id'].str.strip()
Out[5]:
0                   NaN
1                as asd
2        asdsa asdasdas
              ...
29997              asds
29998            as asd
29999    asdsa asdasdas
Name: id, dtype: object

As @EdChum points out, you can indeed use map(str.strip) if you are sure you don't have any NaN values if this performance difference is important.


Coming back to the other difference of fcr['id'].astype(str).map(str.strip). If you already know that the values inside the series are strings, doing the astype(str) call is of course superfluous. And it is this call that explains the difference:

In [74]: %timeit df['id'].astype(str).map(str.strip)
100 loops, best of 3: 10.5 ms per loop

In [75]: %timeit df['id'].astype(str)
100 loops, best of 3: 5.25 ms per loop

In [76]: %timeit df['id'].map(str.strip)
100 loops, best of 3: 5.18 ms per loop

Note that in the case you have non-string values (NaN, numeric values, ...), using .str.strip() and .astype(str).map(str) will not yield the same result:

In [11]: s = pd.Series(['  a', 10])

In [12]: s.astype(str).map(str.strip)
Out[12]:
0     a
1    10
dtype: object

In [13]: s.str.strip()
Out[13]:
0      a
1    NaN
dtype: object

As you can see, .str.strip() will return non-string values as NaN, instead of converting them to strings.

like image 155
joris Avatar answered Oct 03 '22 11:10

joris