There is a great solution in R.
My df.column
looks like:
Windows
Windows
Mac
Mac
Mac
Linux
Windows
...
I want to replace low frequency categories with 'Other' in this df.column
vector. For example, I need my df.column
to look like
Windows
Windows
Mac
Mac
Mac
Linux -> Other
Windows
...
I would like to rename these rare categories, to reduce the number of factors in my regression. This is why I need the original vector. In python, after running the command to get the frequency table I get:
pd.value_counts(df.column)
Windows 26083
iOS 19711
Android 13077
Macintosh 5799
Chrome OS 347
Linux 285
Windows Phone 167
(not set) 22
BlackBerry 11
I wonder if there is a method to rename 'Chrome OS', 'Linux' (low frequency data) into another category (for example category 'Other', and do so in an efficient way.
Mask by finding percentage of occupency i.e :
series = pd.value_counts(df.column)
mask = (series/series.sum() * 100).lt(1)
# To replace df['column'] use np.where I.e
df['column'] = np.where(df['column'].isin(series[mask].index),'Other',df['column'])
To change the index with sum :
new = series[~mask]
new['Other'] = series[mask].sum()
Windows 26083
iOS 19711
Android 13077
Macintosh 5799
Other 832
Name: 1, dtype: int64
If you want to replace the index then :
series.index = np.where(series.index.isin(series[mask].index),'Other',series.index)
Windows 26083
iOS 19711
Android 13077
Macintosh 5799
Other 347
Other 285
Other 167
Other 22
Other 11
Name: 1, dtype: int64
Explanation
(series/series.sum() * 100) # This will give you the percentage i.e
Windows 39.820158
iOS 30.092211
Android 19.964276
Macintosh 8.853165
Chrome OS 0.529755
Linux 0.435101
Windows Phone 0.254954
(not set) 0.033587
BlackBerry 0.016793
Name: 1, dtype: float64
.lt(1)
is equivalent to lesser than 1. That gives you a boolean mask, based on that mask index and assign the data
This is a (late) extension to your question; it applies the rationale of combining low-frequency categories (in proportion less than min_freq
) to the columns of an entire dataframe. It is based on @Bharath's answer.
def condense_category(col, min_freq=0.01, new_name='other'):
series = pd.value_counts(col)
mask = (series/series.sum()).lt(min_freq)
return pd.Series(np.where(col.isin(series[mask].index), new_name, col))
A simple example of application:
df_toy = pd.DataFrame({'x': [1, 2, 3, 4] + [5]*100, 'y': [5, 6, 7, 8] + [0]*100})
df_toy = df_toy.apply(condense_category, axis=0)
print(df_toy)
# x y
# 0 other other
# 1 other other
# 2 other other
# 3 other other
# 4 5 0
# .. ... ...
# 99 5 0
# 100 5 0
# 101 5 0
# 102 5 0
# 103 5 0
#
# [104 rows x 2 columns]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With