Pandas One hot encoding: Bundling together less frequent categories

Tags:

I'm doing one hot encoding over a categorical column which has some 18 different kind of values. I want to create new columns for only those values, which appear more than some threshold (let's say 1%), and create another column named other values which has 1 if value is other than those frequent values.

I'm using Pandas with Sci-kit learn. I've explored pandas get_dummies and sci-kit learn's one hot encoder, but can't figure out how to bundle together less frequent values into one column.

695

asked Apr 10 '17 23:04

anwartheravian

1 Answers

plan

pd.get_dummies to one hot encode as normal
sum() < threshold to identify columns that get aggregated
- I use pd.value_counts with the parameter normalize=True to get percentage of occurance.
join

def hot_mess2(s, thresh):
    d = pd.get_dummies(s)
    f = pd.value_counts(s, sort=False, normalize=True) < thresh
    if f.sum() == 0:
        return d
    else:
        return d.loc[:, ~f].join(d.loc[:, f].sum(1).rename('other'))

Consider the pd.Series s

s = pd.Series(np.repeat(list('abcdef'), range(1, 7)))

s

0     a
1     b
2     b
3     c
4     c
5     c
6     d
7     d
8     d
9     d
10    e
11    e
12    e
13    e
14    e
15    f
16    f
17    f
18    f
19    f
20    f
dtype: object

hot_mess(s, 0)

    a  b  c  d  e  f
0   1  0  0  0  0  0
1   0  1  0  0  0  0
2   0  1  0  0  0  0
3   0  0  1  0  0  0
4   0  0  1  0  0  0
5   0  0  1  0  0  0
6   0  0  0  1  0  0
7   0  0  0  1  0  0
8   0  0  0  1  0  0
9   0  0  0  1  0  0
10  0  0  0  0  1  0
11  0  0  0  0  1  0
12  0  0  0  0  1  0
13  0  0  0  0  1  0
14  0  0  0  0  1  0
15  0  0  0  0  0  1
16  0  0  0  0  0  1
17  0  0  0  0  0  1
18  0  0  0  0  0  1
19  0  0  0  0  0  1
20  0  0  0  0  0  1

hot_mess(s, .1)

    c  d  e  f  other
0   0  0  0  0      1
1   0  0  0  0      1
2   0  0  0  0      1
3   1  0  0  0      0
4   1  0  0  0      0
5   1  0  0  0      0
6   0  1  0  0      0
7   0  1  0  0      0
8   0  1  0  0      0
9   0  1  0  0      0
10  0  0  1  0      0
11  0  0  1  0      0
12  0  0  1  0      0
13  0  0  1  0      0
14  0  0  1  0      0
15  0  0  0  1      0
16  0  0  0  1      0
17  0  0  0  1      0
18  0  0  0  1      0
19  0  0  0  1      0
20  0  0  0  1      0

189

answered Sep 30 '22 17:09

piRSquared

Related questions
                            
                                Merging 2 csv data sets with Python a common ID column- one csv has multiple records for a unique ID
                            
                                What means of namespace object in Python
                            
                                Message box close immediately after using py2exe
                            
                                what is the best way to start google chrome and input a web address by pywinauto
                            
                                Entry point for a bokeh server
                            
                                ZeroMQ fails to .bind() on Docker on [0.0.0.0:5555] - address already in use. Why?
                            
                                How to combine multiple VUnit run.py files into a single VUnit run?
                            
                                pandas list of dictionary to separate columns
                            
                                Counter.most_common(n) how to override arbitrary ordering
                            
                                how to predict my own image using cnn in keras after training on MNIST dataset
                            
                                I am getting an error while creating a simple RDD in Spark
                            
                                TypeError: "unsupported operand type(s) for -: 'Timestamp' and 'str'" pandas
                            
                                DataFrameGroupBy diff() on condition
                            
                                Dask reading CSV, setting partition as CSV length
                            
                                How can I sort within partitions defined by one column but leave the partitions where they are?
                            
                                Overriding python with python3 in vim_configurable.customize
                            
                                Can't connect to mongo from flask in docker containers
                            
                                how extract a vector from groupby pandas in python
                            
                                Making a group in dataframe in pandas
                            
                                What are good options for debugging urwid applications?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas One hot encoding: Bundling together less frequent categories

Tags:

python

pandas

one-hot-encoding

scikit-learn

anwartheravian

People also ask

1 Answers

piRSquared

Recent Activity

Donate For Us