Hi have a pandas dataframe df
containing categorical variables.
df=pandas.DataFrame(data=[['male','blue'],['female','brown'],
['male','black']],columns=['gender','eyes'])
df
Out[16]:
gender eyes
0 male blue
1 female brown
2 male black
using the function get_dummies I get the following dataframe
df_dummies = pandas.get_dummies(df)
df_dummies
Out[18]:
gender_female gender_male eyes_black eyes_blue eyes_brown
0 0 1 0 1 0
1 1 0 0 0 1
2 0 1 1 0 0
Owever the columns gender_female
and gender_male
contain the same information because the original column could assume a binary value. Is there a (smart) way to keep only one of the 2 columns?
UPDATED
The use of
df_dummies = pandas.get_dummies(df,drop_first=True)
Would give me
df_dummies
Out[21]:
gender_male eyes_blue eyes_brown
0 1 1 0
1 0 0 1
2 1 0 0
but I would like to remove the columns for which originally I had only 2 possibilities
The desired result should be
df_dummies
Out[18]:
gender_male eyes_black eyes_blue eyes_brown
0 1 0 1 0
1 0 0 0 1
2 1 1 0 0
get_dummies() is used for data manipulation. It converts categorical data into dummy or indicator variables.
drop_first. The drop_first parameter specifies whether or not you want to drop the first category of the categorical variable you're encoding. By default, this is set to drop_first = False . This will cause get_dummies to create one dummy variable for every level of the input categorical variable.
(1) The get_dummies can't handle the unknown category during the transformation natively. You have to apply some techniques to handle it. But it is not efficient. On the other hand, OneHotEncoder will natively handle unknown categories.
Alternatively, you can split the dataframe into parts you want to apply drop_first=True
and parts you don't. Then concatenate them together.
df1 = df.iloc[:, 0:2]
df2 = df.iloc[:, 2:]
df1 = pd.get_dummies(df1 ,drop_first=True)
df = pd.concat([df1, df2], axis=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With