I'm doing one hot encoding over a categorical column which has some 18 different kind of values. I want to create new columns for only those values, which appear more than some threshold (let's say 1%), and create another column named other values
which has 1 if value is other than those frequent values.
I'm using Pandas with Sci-kit learn. I've explored pandas get_dummies
and sci-kit learn's one hot encoder
, but can't figure out how to bundle together less frequent values into one column.
One hot encoding makes our training data more useful and expressive, and it can be rescaled easily. By using numeric values, we more easily determine a probability for our values. In particular, one hot encoding is used for our output values, since it provides more nuanced predictions than single labels.
When the categorical features present in the dataset are ordinal i.e for the data being like Junior, Senior, Executive, Owner. When the number of categories in the dataset is quite large. One Hot Encoding should be avoided in this case as it can lead to high memory consumption.
One-Hot Encoding is another popular technique for treating categorical variables. It simply creates additional features based on the number of unique values in the categorical feature. Every unique value in the category will be added as a feature. One-Hot Encoding is the process of creating dummy variables.
One-Hot Encoding is the process of creating dummy variables. This technique is used for categorical variables where order does not matter. One-Hot encoding technique is used when the features are nominal(do not have any order). In one hot encoding, for every categorical feature, a new variable is created.
plan
pd.get_dummies
to one hot encode as normalsum() < threshold
to identify columns that get aggregated
pd.value_counts
with the parameter normalize=True
to get percentage of occurance.join
def hot_mess2(s, thresh):
d = pd.get_dummies(s)
f = pd.value_counts(s, sort=False, normalize=True) < thresh
if f.sum() == 0:
return d
else:
return d.loc[:, ~f].join(d.loc[:, f].sum(1).rename('other'))
Consider the pd.Series
s
s = pd.Series(np.repeat(list('abcdef'), range(1, 7)))
s
0 a
1 b
2 b
3 c
4 c
5 c
6 d
7 d
8 d
9 d
10 e
11 e
12 e
13 e
14 e
15 f
16 f
17 f
18 f
19 f
20 f
dtype: object
hot_mess(s, 0)
a b c d e f
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 1 0 0 0 0
3 0 0 1 0 0 0
4 0 0 1 0 0 0
5 0 0 1 0 0 0
6 0 0 0 1 0 0
7 0 0 0 1 0 0
8 0 0 0 1 0 0
9 0 0 0 1 0 0
10 0 0 0 0 1 0
11 0 0 0 0 1 0
12 0 0 0 0 1 0
13 0 0 0 0 1 0
14 0 0 0 0 1 0
15 0 0 0 0 0 1
16 0 0 0 0 0 1
17 0 0 0 0 0 1
18 0 0 0 0 0 1
19 0 0 0 0 0 1
20 0 0 0 0 0 1
hot_mess(s, .1)
c d e f other
0 0 0 0 0 1
1 0 0 0 0 1
2 0 0 0 0 1
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
6 0 1 0 0 0
7 0 1 0 0 0
8 0 1 0 0 0
9 0 1 0 0 0
10 0 0 1 0 0
11 0 0 1 0 0
12 0 0 1 0 0
13 0 0 1 0 0
14 0 0 1 0 0
15 0 0 0 1 0
16 0 0 0 1 0
17 0 0 0 1 0
18 0 0 0 1 0
19 0 0 0 1 0
20 0 0 0 1 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With