Replace low frequency categorical values from pandas.dataframe while ignoring NaNs

Question

How can I replace the values from certain columns in a pandas.DataFrame that occur rarely, i.e. with low frequency (while ignoring NaNs)?

For example, in the following dataframe, suppose I wanted to replace any values in columns A or B that occur less than three times in their respective column. I want to replace these with "other":

import pandas as pd
import numpy as np

df = pd.DataFrame({'A':['ant','ant','cherry', pd.np.nan, 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog',pd.np.nan, 'emu', 'emu']})
df
   A   |   B   |  C  |
----------------------
ant    | cat   | dog |
ant    | peach | dog |
cherry | cat   | NaN |
NaN    | cat   | emu |
ant    | peach | emu |

In other words, in columns A and B, I want to replace those values that occur twice or less (but leave NaNs alone).

So the output I want is:

   A   |   B   |  C  |
----------------------
ant    | cat   | dog |
ant    | other | dog |
other  | cat   | NaN |
NaN    | cat   | emu |
ant    | other | emu |

This is related to a previously posted question: Remove low frequency values from pandas.dataframe

but the solution there resulted in an "AttributeError: 'NoneType' object has no attribute 'any.'" (I think because I have NaN values?)

ayhan · Accepted Answer

This is pretty similar to Change values in pandas dataframe according to value_counts(). You can add a condition to the lambda function to exclude column 'C' as follows:

df.apply(lambda x: x.mask(x.map(x.value_counts())<3, 'other') if x.name!='C' else x)
Out: 
       A      B    C
0    ant    cat  dog
1    ant  other  dog
2  other    cat  NaN
3    NaN    cat  emu
4    ant  other  emu

This basically iterates over columns. For each column, it generates value counts and uses that Series for mapping. This allows x.mask to check the condition whether the count is smaller than 3 or not. If that is the case, it returns 'other' and if not it uses the actual value. Lastly, a condition checks the column name.

lambda's condition can be generalized for multiple columns by changing it to x.name not in 'CDEF' or x.name not in ['C', 'D', 'E', 'F'] from x.name!='C'.

piRSquared · Answer

using a helper function and replace

def replace_low_freq(df, threshold=2, replacement='other'):
    s = df.stack()
    c = s.value_counts()
    m = pd.Series(replacement, c.index[c <= threshold])
    return s.replace(m).unstack()

cols = list('AB')
replace_low_freq(df[cols]).join(df.drop(cols, 1))

       A      B    C
0    ant    cat  dog
1    ant  other  dog
2  other    cat  NaN
3   None    cat  emu
4    ant  other  emu

Replace low frequency categorical values from pandas.dataframe while ignoring NaNs

Tags:

python-3.x

pandas

Imu

2 Answers

ayhan

piRSquared

Recent Activity

Donate For Us

Replace low frequency categorical values from pandas.dataframe while ignoring NaNs

Tags:

python-3.x

pandas

Imu

2 Answers

ayhan

piRSquared

Related questions

Recent Activity

Donate For Us