I have a Pandas DataFrame with categorical data written by humans. Let's say this:
>>> df = pd.DataFrame({'name': ["A", " A", "A ", "b", "B"]})
name
0 A
1 A
2 A
3 b
4 B
I want to normalize these values by stripping spaces and uppercasing them. This works great:
>>> df.apply(lambda x: x['name'].upper().strip(), axis=1)
0 A
1 A
2 A
3 B
4 B
The issue I'm having is that I also have a few nan
values, and I effectively want those to remain as nan
s after this transformation. But if I have this:
>>> df2 = pd.DataFrame({'name': ["A", " A", "A ", "b", "B", np.nan]})
>>> df2.apply(lambda x: x['name'].upper().strip(), axis=1)
("'float' object has no attribute 'upper'", u'occurred at index 5')
What I'd like is this:
0 A
1 A
2 A
3 B
4 B
5 NaN
I understand why this is happening (nan is a float, while others are strings), but I can't find an elegant way of writing this..
Any thoughts?
We can replace the NaN with an empty string using df. replace() function. This function will replace an empty string inplace of the NaN value.
We can check if a string is NaN by using the property of NaN object that a NaN != NaN. Let us define a boolean function isNaN() which returns true if the given argument is a NaN and returns false otherwise. We can also take a value and convert it to float to check whether it is NaN.
By using replace() or fillna() methods you can replace NaN values with Blank/Empty string in Pandas DataFrame. NaN stands for Not A Number and is one of the common ways to represent the missing data value in Python/Pandas DataFrame.
We can replace NaN values with 0 to get rid of NaN values. This is done by using fillna() function. This function will check the NaN values in the dataframe columns and fill the given value.
You can use the vectorized str
operators:
>>> df2.name.str.strip().str.upper()
0 A
1 A
2 A
3 B
4 B
5 NaN
Name: name, dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With