How can I replace the values from certain columns in a pandas.DataFrame that occur rarely, i.e. with low frequency (while ignoring NaNs)?
For example, in the following dataframe, suppose I wanted to replace any values in columns A or B that occur less than three times in their respective column. I want to replace these with "other":
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['ant','ant','cherry', pd.np.nan, 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog',pd.np.nan, 'emu', 'emu']})
df
A | B | C |
----------------------
ant | cat | dog |
ant | peach | dog |
cherry | cat | NaN |
NaN | cat | emu |
ant | peach | emu |
In other words, in columns A and B, I want to replace those values that occur twice or less (but leave NaNs alone).
So the output I want is:
A | B | C |
----------------------
ant | cat | dog |
ant | other | dog |
other | cat | NaN |
NaN | cat | emu |
ant | other | emu |
This is related to a previously posted question: Remove low frequency values from pandas.dataframe
but the solution there resulted in an "AttributeError: 'NoneType' object has no attribute 'any.'" (I think because I have NaN values?)
This is pretty similar to Change values in pandas dataframe according to value_counts(). You can add a condition to the lambda function to exclude column 'C' as follows:
df.apply(lambda x: x.mask(x.map(x.value_counts())<3, 'other') if x.name!='C' else x)
Out:
A B C
0 ant cat dog
1 ant other dog
2 other cat NaN
3 NaN cat emu
4 ant other emu
This basically iterates over columns. For each column, it generates value counts and uses that Series for mapping. This allows x.mask
to check the condition whether the count is smaller than 3 or not. If that is the case, it returns 'other' and if not it uses the actual value. Lastly, a condition checks the column name.
lambda's condition can be generalized for multiple columns by changing it to x.name not in 'CDEF'
or x.name not in ['C', 'D', 'E', 'F']
from x.name!='C'
.
using a helper function and replace
def replace_low_freq(df, threshold=2, replacement='other'):
s = df.stack()
c = s.value_counts()
m = pd.Series(replacement, c.index[c <= threshold])
return s.replace(m).unstack()
cols = list('AB')
replace_low_freq(df[cols]).join(df.drop(cols, 1))
A B C
0 ant cat dog
1 ant other dog
2 other cat NaN
3 None cat emu
4 ant other emu
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With