Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get subset of most frequent dummy variables in pandas

Tags:

python

pandas

I am trying to perform some linear regression analysis, I have some categorical features that i convert to dummy variables using the super awesome get_dummies.

The issue I face is, the dataframe gets too big when I add all the elements of the categories.

Is there a way (using get_dummies or a more elaborate method) to just create dummy variables of the most frequent terms instead of all of them?

like image 808
Manuel G Avatar asked Aug 02 '13 12:08

Manuel G


People also ask

What's the use of Pandas Get_dummies () method?

get_dummies() is used for data manipulation. It converts categorical data into dummy or indicator variables.

What does Drop_first do in Get_dummies?

drop_first. The drop_first parameter specifies whether or not you want to drop the first category of the categorical variable you're encoding. By default, this is set to drop_first = False . This will cause get_dummies to create one dummy variable for every level of the input categorical variable.

How can we handle dummy variable trap?

To overcome the Dummy variable Trap, we drop one of the columns created when the categorical variables were converted to dummy variables by one-hot encoding. This can be done because the dummy variables include redundant information.

What is the difference between OneHotEncoder and Get_dummies?

(1) The get_dummies can't handle the unknown category during the transformation natively. You have to apply some techniques to handle it. But it is not efficient. On the other hand, OneHotEncoder will natively handle unknown categories.


1 Answers

I used the answer that @HYRY gave to write a function that will have a parameter (threshold) that can be used to separate the popular values and unpopular ones (combined in 'others' column).

import pandas as pd
import numpy as np

# func that returns a dummified DataFrame of significant dummies in a given column
def dum_sign(dummy_col, threshold=0.1):

    # removes the bind
    dummy_col = dummy_col.copy()

    # what is the ratio of a dummy in whole column
    count = pd.value_counts(dummy_col) / len(dummy_col)

    # cond whether the ratios is higher than the threshold
    mask = dummy_col.isin(count[count > threshold].index)

    # replace the ones which ratio is lower than the threshold by a special name
    dummy_col[~mask] = "others"

    return pd.get_dummies(dummy_col, prefix=dummy_col.name)
#

Let's create some data:

df = ['a', 'a', np.nan, np.nan, 'a', np.nan, 'a', 'b', 'b', 'b', 'b', 'b', 
             'c', 'c', 'd', 'e', 'g', 'g', 'g', 'g']

data = pd.Series(df, name='dums')

Examples of use:

 In: dum_sign(data)
Out:
    dums_a  dums_b  dums_g  dums_others
0        1       0       0            0
1        1       0       0            0
2        0       0       0            1
3        0       0       0            1
4        1       0       0            0
5        0       0       0            1
6        1       0       0            0
7        0       1       0            0
8        0       1       0            0
9        0       1       0            0
10       0       1       0            0
11       0       1       0            0
12       0       0       0            1
13       0       0       0            1
14       0       0       0            1
15       0       0       0            1
16       0       0       1            0
17       0       0       1            0
18       0       0       1            0
19       0       0       1            0

 In: dum_sign(data, threshold=0.2)
Out: 
    dums_b  dums_others
0        0            1
1        0            1
2        0            1
3        0            1
4        0            1
5        0            1
6        0            1
7        1            0
8        1            0
9        1            0
10       1            0
11       1            0
12       0            1
13       0            1
14       0            1
15       0            1
16       0            1
17       0            1
18       0            1
19       0            1

 In: dum_sign(data, threshold=0)
Out: 
    dums_a  dums_b  dums_c  dums_d  dums_e  dums_g  dums_others
0        1       0       0       0       0       0            0
1        1       0       0       0       0       0            0
2        0       0       0       0       0       0            1
3        0       0       0       0       0       0            1
4        1       0       0       0       0       0            0
5        0       0       0       0       0       0            1
6        1       0       0       0       0       0            0
7        0       1       0       0       0       0            0
8        0       1       0       0       0       0            0
9        0       1       0       0       0       0            0
10       0       1       0       0       0       0            0
11       0       1       0       0       0       0            0
12       0       0       1       0       0       0            0
13       0       0       1       0       0       0            0
14       0       0       0       1       0       0            0
15       0       0       0       0       1       0            0
16       0       0       0       0       0       1            0
17       0       0       0       0       0       1            0
18       0       0       0       0       0       1            0
19       0       0       0       0       0       1            0

Any suggestions how to handle nans? I believe that nans should not be treated as 'others'.

UPD: I have tested it on a pretty large dataset (5 mil obs) with 183 different strings in the column that I wanted to dummify. The implementation takes 10 sec max on my laptop.

like image 175
Vladimir Iashin Avatar answered Sep 26 '22 05:09

Vladimir Iashin