I thought the third option was supposed to be the fastest way to strip whitespaces? Can someone give me some general rules that I should be applying when working with large data sets? I normally use .astype(str) but clearly that is not worthwhile for columns which I know are objects already.
%%timeit
fcr['id'] = fcr['id'].astype(str).map(str.strip)
10 loops, best of 3: 47.8 ms per loop
%%timeit
fcr['id'] = fcr['id'].map(str.strip)
10 loops, best of 3: 25.2 ms per loop
%%timeit
fcr['id'] = fcr['id'].str.strip(' ')
10 loops, best of 3: 55.5 ms per loop
lstrip() is used to remove spaces from the left side of string, str. rstrip() to remove spaces from right side of the string and str. strip() removes spaces from both sides.
Polars supports eager evaluation and lazy evaluation whereas Pandas only supports eager evaluation.
str can be used to access the values of the series as strings and apply several methods to it. Pandas Series. str. contains() function is used to test if pattern or regex is contained within a string of a Series or Index.
Let's first look at the difference between .map(str.strip)
and .str.strip()
(second and third case).
Therefore, you need to understand what str.strip()
does under the hood: it actually does some map(str.strip)
, but using a custom map
function that will handle missing values.
So given that .str.strip()
does more than .map(str.strip)
, it is to be expected that this method will always be slower (and as you have shown, in your case 2x slower).
Using the .str.strip()
method has it advantages in the automatic NaN handling (or handling of other non-string values). Suppose the 'id' column contains a NaN value:
In [4]: df['id'].map(str.strip)
...
TypeError: descriptor 'strip' requires a 'str' object but received a 'float'
In [5]: df['id'].str.strip()
Out[5]:
0 NaN
1 as asd
2 asdsa asdasdas
...
29997 asds
29998 as asd
29999 asdsa asdasdas
Name: id, dtype: object
As @EdChum points out, you can indeed use map(str.strip)
if you are sure you don't have any NaN values if this performance difference is important.
Coming back to the other difference of fcr['id'].astype(str).map(str.strip)
. If you already know that the values inside the series are strings, doing the astype(str)
call is of course superfluous. And it is this call that explains the difference:
In [74]: %timeit df['id'].astype(str).map(str.strip)
100 loops, best of 3: 10.5 ms per loop
In [75]: %timeit df['id'].astype(str)
100 loops, best of 3: 5.25 ms per loop
In [76]: %timeit df['id'].map(str.strip)
100 loops, best of 3: 5.18 ms per loop
Note that in the case you have non-string values (NaN, numeric values, ...), using .str.strip()
and .astype(str).map(str)
will not yield the same result:
In [11]: s = pd.Series([' a', 10])
In [12]: s.astype(str).map(str.strip)
Out[12]:
0 a
1 10
dtype: object
In [13]: s.str.strip()
Out[13]:
0 a
1 NaN
dtype: object
As you can see, .str.strip()
will return non-string values as NaN, instead of converting them to strings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With