Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace low frequency categorical values from pandas.dataframe while ignoring NaNs

How can I replace the values from certain columns in a pandas.DataFrame that occur rarely, i.e. with low frequency (while ignoring NaNs)?

For example, in the following dataframe, suppose I wanted to replace any values in columns A or B that occur less than three times in their respective column. I want to replace these with "other":

import pandas as pd
import numpy as np

df = pd.DataFrame({'A':['ant','ant','cherry', pd.np.nan, 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog',pd.np.nan, 'emu', 'emu']})
df
   A   |   B   |  C  |
----------------------
ant    | cat   | dog |
ant    | peach | dog |
cherry | cat   | NaN |
NaN    | cat   | emu |
ant    | peach | emu |

In other words, in columns A and B, I want to replace those values that occur twice or less (but leave NaNs alone).

So the output I want is:

   A   |   B   |  C  |
----------------------
ant    | cat   | dog |
ant    | other | dog |
other  | cat   | NaN |
NaN    | cat   | emu |
ant    | other | emu |

This is related to a previously posted question: Remove low frequency values from pandas.dataframe

but the solution there resulted in an "AttributeError: 'NoneType' object has no attribute 'any.'" (I think because I have NaN values?)

like image 517
Imu Avatar asked Jan 10 '17 20:01

Imu


2 Answers

This is pretty similar to Change values in pandas dataframe according to value_counts(). You can add a condition to the lambda function to exclude column 'C' as follows:

df.apply(lambda x: x.mask(x.map(x.value_counts())<3, 'other') if x.name!='C' else x)
Out: 
       A      B    C
0    ant    cat  dog
1    ant  other  dog
2  other    cat  NaN
3    NaN    cat  emu
4    ant  other  emu

This basically iterates over columns. For each column, it generates value counts and uses that Series for mapping. This allows x.mask to check the condition whether the count is smaller than 3 or not. If that is the case, it returns 'other' and if not it uses the actual value. Lastly, a condition checks the column name.

lambda's condition can be generalized for multiple columns by changing it to x.name not in 'CDEF' or x.name not in ['C', 'D', 'E', 'F'] from x.name!='C'.

like image 141
ayhan Avatar answered Sep 28 '22 00:09

ayhan


using a helper function and replace

def replace_low_freq(df, threshold=2, replacement='other'):
    s = df.stack()
    c = s.value_counts()
    m = pd.Series(replacement, c.index[c <= threshold])
    return s.replace(m).unstack()

cols = list('AB')
replace_low_freq(df[cols]).join(df.drop(cols, 1))

       A      B    C
0    ant    cat  dog
1    ant  other  dog
2  other    cat  NaN
3   None    cat  emu
4    ant  other  emu
like image 24
piRSquared Avatar answered Sep 28 '22 02:09

piRSquared