Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

removing redundant columns when using get_dummies

Hi have a pandas dataframe df containing categorical variables.

df=pandas.DataFrame(data=[['male','blue'],['female','brown'],
['male','black']],columns=['gender','eyes'])

df
Out[16]: 
   gender   eyes
0    male   blue
1  female  brown
2    male  black

using the function get_dummies I get the following dataframe

df_dummies = pandas.get_dummies(df)

df_dummies
Out[18]: 
   gender_female  gender_male  eyes_black  eyes_blue  eyes_brown
0              0            1           0          1           0
1              1            0           0          0           1
2              0            1           1          0           0

Owever the columns gender_female and gender_male contain the same information because the original column could assume a binary value. Is there a (smart) way to keep only one of the 2 columns?

UPDATED

The use of

df_dummies = pandas.get_dummies(df,drop_first=True)

Would give me

df_dummies
Out[21]: 
   gender_male  eyes_blue  eyes_brown
0            1          1           0
1            0          0           1
2            1          0           0

but I would like to remove the columns for which originally I had only 2 possibilities

The desired result should be

df_dummies
Out[18]: 
   gender_male  eyes_black  eyes_blue  eyes_brown
0  1           0          1           0
1  0           0          0           1
2  1           1          0           0
like image 559
gabboshow Avatar asked May 04 '18 13:05

gabboshow


People also ask

What does the Get_dummies () function in Pandas do?

get_dummies() is used for data manipulation. It converts categorical data into dummy or indicator variables.

Why we use Drop_first in Get_dummies?

drop_first. The drop_first parameter specifies whether or not you want to drop the first category of the categorical variable you're encoding. By default, this is set to drop_first = False . This will cause get_dummies to create one dummy variable for every level of the input categorical variable.

What is the difference between OneHotEncoder and Get_dummies?

(1) The get_dummies can't handle the unknown category during the transformation natively. You have to apply some techniques to handle it. But it is not efficient. On the other hand, OneHotEncoder will natively handle unknown categories.


1 Answers

Alternatively, you can split the dataframe into parts you want to apply drop_first=True and parts you don't. Then concatenate them together.

df1 = df.iloc[:, 0:2]
df2 = df.iloc[:, 2:]
df1 = pd.get_dummies(df1 ,drop_first=True)

df = pd.concat([df1, df2], axis=1) 
like image 93
David LE Avatar answered Sep 29 '22 11:09

David LE