I have a Pandas DataFrame with categorical data written by humans. Let's say this: <pre class="prettyprint"><code>>>> df = pd.DataFrame({'name': ["A", " A", "A ", "b", "B"]}) name 0 A 1 A 2 A 3 b 4 B </code></pre> I want to normalize these values by stripping spaces and uppercasing them. This works great: <pre class="prettyprint"><code>>>> df.apply(lambda x: x['name'].upper().strip(), axis=1) 0 A 1 A 2 A 3 B 4 B </code></pre> The issue I'm having is that I also have a few <code>nan</code> values, and I effectively want those to remain as <code>nan</code>s after this transformation. But if I have this: <pre class="prettyprint"><code>>>> df2 = pd.DataFrame({'name': ["A", " A", "A ", "b", "B", np.nan]}) >>> df2.apply(lambda x: x['name'].upper().strip(), axis=1) ("'float' object has no attribute 'upper'", u'occurred at index 5') </code></pre> What I'd like is this: <pre class="prettyprint"><code>0 A 1 A 2 A 3 B 4 B 5 NaN </code></pre> I understand why this is happening (nan is a float, while others are strings), but I can't find an elegant way of writing this.. Any thoughts?

You can use the vectorized <code>str</code> operators: <pre class="prettyprint"><code>>>> df2.name.str.strip().str.upper() 0 A 1 A 2 A 3 B 4 B 5 NaN Name: name, dtype: object </code></pre>

Applying string functions to elements that can be NaN

Tags:

python

string

pandas

I have a Pandas DataFrame with categorical data written by humans. Let's say this:

>>> df = pd.DataFrame({'name': ["A", " A", "A ", "b", "B"]})
  name
0    A
1    A
2   A
3    b
4    B

I want to normalize these values by stripping spaces and uppercasing them. This works great:

>>> df.apply(lambda x: x['name'].upper().strip(), axis=1)
0    A
1    A
2    A
3    B
4    B

The issue I'm having is that I also have a few nan values, and I effectively want those to remain as nans after this transformation. But if I have this:

>>> df2 = pd.DataFrame({'name': ["A", " A", "A ", "b", "B", np.nan]})
>>> df2.apply(lambda x: x['name'].upper().strip(), axis=1)
("'float' object has no attribute 'upper'", u'occurred at index 5')

What I'd like is this:

I understand why this is happening (nan is a float, while others are strings), but I can't find an elegant way of writing this..

Any thoughts?

976

asked Nov 02 '15 23:11

user1496984

1 Answers

You can use the vectorized str operators:

>>> df2.name.str.strip().str.upper()
0      A
1      A
2      A
3      B
4      B
5    NaN
Name: name, dtype: object

answered Sep 28 '22 01:09

Alexander

Related questions
                            
                                Filter a numpy array based on largest value
                            
                                pandas equivalent to numpy.roll
                            
                                Install xgboost under python with 32-bit msys failing
                            
                                How to create a title page for a PDF document created using matplotlib
                            
                                Using cursor.execute arguments in pymssql with IN sql statement
                            
                                In IPython Widgets, how to update the DropDown widget with new value?
                            
                                Passing additional arguments to python pandas DataFrame apply
                            
                                How do I print the variable arguments with names from previous stack?
                            
                                Is it possible to run 2 seperate .travis.yml files from the same github repository
                            
                                Download a file to a specific folder with python
                            
                                How to get user posts through facebook-sdk python api?
                            
                                AttributeError: can't set attribute from nltk.book import *
                            
                                With py.test, database is not reset after LiveServerTestCase
                            
                                HDF5 file grows in size after overwriting the pandas dataframe
                            
                                import urllib.parse fails when Python run from command line
                            
                                Creating a temporary async timer callback to a bound method with python-asyncio
                            
                                Need help understanding size_hint in kivy
                            
                                "WindowsError: [Error 5] Access is denied" using urllib2
                            
                                Printing valid combination of parentheses in python
                            
                                Python Multiprocess Pool. How to exit the script when one of the worker process determines no more work needs to be done?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With