Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: get_dummies vs categorical

I have a dataset which has a few columns with categorical data.

I've been using the Categorical function to replace categorical values with numerical ones.

data[column] = pd.Categorical.from_array(data[column]).codes

I've recently ran across the pandas.get_dummies function. Are these interchangeable? Is there an advantage of using one over the other?

like image 875
sapo_cosmico Avatar asked Mar 23 '15 22:03

sapo_cosmico


People also ask

What does Get_dummies do Pandas?

get_dummies. Convert categorical variable into dummy/indicator variables.

Is PD Get_dummies the same as one hot encoding?

One-hot Encoder is a popular feature encoding strategy that performs similar to pd. get_dummies() with added advantages. It encodes a nominal or categorical feature by assigning one binary column per category per categorical feature. Scikit-learn comes with the implementation of the one-hot encoder.

What is difference between LabelEncoder and Get_dummies?

Looking at your problem , get_dummies is the option to go with as it would give equal weightage to the categorical variables. LabelEncoder is used when the categorical variables are ordinal i.e. if you are converting severity or ranking, then LabelEncoding "High" as 2 and "low" as 1 would make sense.

Why we use Drop_first in Get_dummies?

drop_first. The drop_first parameter specifies whether or not you want to drop the first category of the categorical variable you're encoding. By default, this is set to drop_first = False . This will cause get_dummies to create one dummy variable for every level of the input categorical variable.


1 Answers

Why are you converting the categorical datas to integers? I don't believe you save memory if that is your goal.

df = pd.DataFrame({'cat': pd.Categorical(['a', 'a', 'a', 'b', 'b', 'c'])})
df2 = pd.DataFrame({'cat': [1, 1, 1, 2, 2, 3]})

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 1 columns):
cat    6 non-null category
dtypes: category(1)
memory usage: 78.0 bytes

>>> df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 1 columns):
cat    6 non-null int64
dtypes: int64(1)
memory usage: 96.0 bytes

The categorical codes are just integer values for the unique items in the given category. By contrast, get_dummies returns a new column for each unique item. The value in the column indicates whether or not the record has that attribute.

>>> pd.core.reshape.get_dummies(df)
Out[30]: 
   cat_a  cat_b  cat_c
0      1      0      0
1      1      0      0
2      1      0      0
3      0      1      0
4      0      1      0
5      0      0      1

To get the codes directly, you can use:

df['codes'] = [df.cat.codes.to_list()]
like image 139
Alexander Avatar answered Sep 24 '22 23:09

Alexander