I am trying to perform some linear regression analysis, I have some categorical features that i convert to dummy variables using the super awesome get_dummies.
The issue I face is, the dataframe gets too big when I add all the elements of the categories.
Is there a way (using get_dummies or a more elaborate method) to just create dummy variables of the most frequent terms instead of all of them?
get_dummies() is used for data manipulation. It converts categorical data into dummy or indicator variables.
drop_first. The drop_first parameter specifies whether or not you want to drop the first category of the categorical variable you're encoding. By default, this is set to drop_first = False . This will cause get_dummies to create one dummy variable for every level of the input categorical variable.
To overcome the Dummy variable Trap, we drop one of the columns created when the categorical variables were converted to dummy variables by one-hot encoding. This can be done because the dummy variables include redundant information.
(1) The get_dummies can't handle the unknown category during the transformation natively. You have to apply some techniques to handle it. But it is not efficient. On the other hand, OneHotEncoder will natively handle unknown categories.
I used the answer that @HYRY gave to write a function that will have a parameter (threshold) that can be used to separate the popular values and unpopular ones (combined in 'others' column).
import pandas as pd
import numpy as np
# func that returns a dummified DataFrame of significant dummies in a given column
def dum_sign(dummy_col, threshold=0.1):
# removes the bind
dummy_col = dummy_col.copy()
# what is the ratio of a dummy in whole column
count = pd.value_counts(dummy_col) / len(dummy_col)
# cond whether the ratios is higher than the threshold
mask = dummy_col.isin(count[count > threshold].index)
# replace the ones which ratio is lower than the threshold by a special name
dummy_col[~mask] = "others"
return pd.get_dummies(dummy_col, prefix=dummy_col.name)
#
Let's create some data:
df = ['a', 'a', np.nan, np.nan, 'a', np.nan, 'a', 'b', 'b', 'b', 'b', 'b',
'c', 'c', 'd', 'e', 'g', 'g', 'g', 'g']
data = pd.Series(df, name='dums')
Examples of use:
In: dum_sign(data)
Out:
dums_a dums_b dums_g dums_others
0 1 0 0 0
1 1 0 0 0
2 0 0 0 1
3 0 0 0 1
4 1 0 0 0
5 0 0 0 1
6 1 0 0 0
7 0 1 0 0
8 0 1 0 0
9 0 1 0 0
10 0 1 0 0
11 0 1 0 0
12 0 0 0 1
13 0 0 0 1
14 0 0 0 1
15 0 0 0 1
16 0 0 1 0
17 0 0 1 0
18 0 0 1 0
19 0 0 1 0
In: dum_sign(data, threshold=0.2)
Out:
dums_b dums_others
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
5 0 1
6 0 1
7 1 0
8 1 0
9 1 0
10 1 0
11 1 0
12 0 1
13 0 1
14 0 1
15 0 1
16 0 1
17 0 1
18 0 1
19 0 1
In: dum_sign(data, threshold=0)
Out:
dums_a dums_b dums_c dums_d dums_e dums_g dums_others
0 1 0 0 0 0 0 0
1 1 0 0 0 0 0 0
2 0 0 0 0 0 0 1
3 0 0 0 0 0 0 1
4 1 0 0 0 0 0 0
5 0 0 0 0 0 0 1
6 1 0 0 0 0 0 0
7 0 1 0 0 0 0 0
8 0 1 0 0 0 0 0
9 0 1 0 0 0 0 0
10 0 1 0 0 0 0 0
11 0 1 0 0 0 0 0
12 0 0 1 0 0 0 0
13 0 0 1 0 0 0 0
14 0 0 0 1 0 0 0
15 0 0 0 0 1 0 0
16 0 0 0 0 0 1 0
17 0 0 0 0 0 1 0
18 0 0 0 0 0 1 0
19 0 0 0 0 0 1 0
Any suggestions how to handle nans? I believe that nans should not be treated as 'others'.
UPD: I have tested it on a pretty large dataset (5 mil obs) with 183 different strings in the column that I wanted to dummify. The implementation takes 10 sec max on my laptop.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With