I want to find all values in a Pandas dataframe that contain whitespace (any arbitrary amount) and replace those values with NaNs.
Any ideas how this can be improved?
Basically I want to turn this:
A B C 2000-01-01 -0.532681 foo 0 2000-01-02 1.490752 bar 1 2000-01-03 -1.387326 foo 2 2000-01-04 0.814772 baz 2000-01-05 -0.222552 4 2000-01-06 -1.176781 qux
Into this:
A B C 2000-01-01 -0.532681 foo 0 2000-01-02 1.490752 bar 1 2000-01-03 -1.387326 foo 2 2000-01-04 0.814772 baz NaN 2000-01-05 -0.222552 NaN 4 2000-01-06 -1.176781 qux NaN
I've managed to do it with the code below, but man is it ugly. It's not Pythonic and I'm sure it's not the most efficient use of pandas either. I loop through each column and do boolean replacement against a column mask generated by applying a function that does a regex search of each value, matching on whitespace.
for i in df.columns: df[i][df[i].apply(lambda i: True if re.search('^\s*$', str(i)) else False)]=None
It could be optimized a bit by only iterating through fields that could contain empty strings:
if df[i].dtype == np.dtype('object')
But that's not much of an improvement
And finally, this code sets the target strings to None, which works with Pandas' functions like fillna()
, but it would be nice for completeness if I could actually insert a NaN
directly instead of None
.
To replace blank values (white space) with NaN in Python Pandas, we can call replace on the data frame. to create the df` data frame. Then we replace all whitespace values with NaN by call replace with the regex to match whitespaces, np. nan and regex set to True .
lstrip() is used to remove spaces from the left side of string, str. rstrip() to remove spaces from right side of the string and str. strip() removes spaces from both sides. Since these are pandas function with same name as Python's default functions, .
I think df.replace()
does the job, since pandas 0.13:
df = pd.DataFrame([ [-0.532681, 'foo', 0], [1.490752, 'bar', 1], [-1.387326, 'foo', 2], [0.814772, 'baz', ' '], [-0.222552, ' ', 4], [-1.176781, 'qux', ' '], ], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06')) # replace field that's entirely space (or empty) with NaN print(df.replace(r'^\s*$', np.nan, regex=True))
Produces:
A B C 2000-01-01 -0.532681 foo 0 2000-01-02 1.490752 bar 1 2000-01-03 -1.387326 foo 2 2000-01-04 0.814772 baz NaN 2000-01-05 -0.222552 NaN 4 2000-01-06 -1.176781 qux NaN
As Temak pointed it out, use df.replace(r'^\s+$', np.nan, regex=True)
in case your valid data contains white spaces.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With