Performance of str.strip for Pandas

Tags:

I thought the third option was supposed to be the fastest way to strip whitespaces? Can someone give me some general rules that I should be applying when working with large data sets? I normally use .astype(str) but clearly that is not worthwhile for columns which I know are objects already.

%%timeit
fcr['id'] = fcr['id'].astype(str).map(str.strip)
10 loops, best of 3: 47.8 ms per loop

%%timeit
fcr['id'] = fcr['id'].map(str.strip)
10 loops, best of 3: 25.2 ms per loop

%%timeit
fcr['id'] = fcr['id'].str.strip(' ')
10 loops, best of 3: 55.5 ms per loop

510

asked Jan 18 '16 19:01

LeviJr Canlas

1 Answers

Let's first look at the difference between .map(str.strip) and .str.strip() (second and third case).
Therefore, you need to understand what str.strip() does under the hood: it actually does some map(str.strip), but using a custom map function that will handle missing values.
So given that .str.strip() does more than .map(str.strip), it is to be expected that this method will always be slower (and as you have shown, in your case 2x slower).

Using the .str.strip() method has it advantages in the automatic NaN handling (or handling of other non-string values). Suppose the 'id' column contains a NaN value:

In [4]: df['id'].map(str.strip)
...
TypeError: descriptor 'strip' requires a 'str' object but received a 'float'

In [5]: df['id'].str.strip()
Out[5]:
0                   NaN
1                as asd
2        asdsa asdasdas
              ...
29997              asds
29998            as asd
29999    asdsa asdasdas
Name: id, dtype: object

As @EdChum points out, you can indeed use map(str.strip) if you are sure you don't have any NaN values if this performance difference is important.

Coming back to the other difference of fcr['id'].astype(str).map(str.strip). If you already know that the values inside the series are strings, doing the astype(str) call is of course superfluous. And it is this call that explains the difference:

In [74]: %timeit df['id'].astype(str).map(str.strip)
100 loops, best of 3: 10.5 ms per loop

In [75]: %timeit df['id'].astype(str)
100 loops, best of 3: 5.25 ms per loop

In [76]: %timeit df['id'].map(str.strip)
100 loops, best of 3: 5.18 ms per loop

Note that in the case you have non-string values (NaN, numeric values, ...), using .str.strip() and .astype(str).map(str) will not yield the same result:

In [11]: s = pd.Series(['  a', 10])

In [12]: s.astype(str).map(str.strip)
Out[12]:
0     a
1    10
dtype: object

In [13]: s.str.strip()
Out[13]:
0      a
1    NaN
dtype: object

As you can see, .str.strip() will return non-string values as NaN, instead of converting them to strings.

155

answered Oct 03 '22 11:10

joris

Related questions
                            
                                How to Reload a Python3 C extension module?
                            
                                Pass dict with non string keywords to function in kwargs
                            
                                How is Python's map_async keeping results in order?
                            
                                How to install setproctitle on windows?
                            
                                Implementation of NoneType, Reasons and Details
                            
                                What is the difference between a cornice.Service and cornice.resource in Cornice?
                            
                                Difference between `yield from foo()` and `for x in foo(): yield x`
                            
                                How to copy instances of a custom defined class in Python 3.3?
                            
                                QTableWidget Current Selection Change Signal
                            
                                How to pickle and unpickle
                            
                                You have 3 unapplied migration(s). Your project may not work properly until you apply the migrations for app(s): admin, auth
                            
                                ImportError: No Module named 'driver' in pyttsx
                            
                                Overriding the Update method of Django queryset
                            
                                Loading special characters with PyYaml
                            
                                Using ordered dictionary as ordered set
                            
                                Pandas to_csv(sys.stdout) doesn't work under my environment
                            
                                Efficient way to loop over 2D array
                            
                                Should I process a JSON in the frontend or Backend, which is faster? [closed]
                            
                                Is there any Python 3 module to create PDF files? [closed]
                            
                                Make Python's `warnings.warn()` not mention itself

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Performance of str.strip for Pandas

Tags:

python-3.x

pandas

LeviJr Canlas

People also ask

1 Answers

joris

Recent Activity

Donate For Us