I have a dataset which has a few columns with categorical data.
I've been using the Categorical function to replace categorical values with numerical ones.
data[column] = pd.Categorical.from_array(data[column]).codes
I've recently ran across the pandas.get_dummies function. Are these interchangeable? Is there an advantage of using one over the other?
get_dummies. Convert categorical variable into dummy/indicator variables.
One-hot Encoder is a popular feature encoding strategy that performs similar to pd. get_dummies() with added advantages. It encodes a nominal or categorical feature by assigning one binary column per category per categorical feature. Scikit-learn comes with the implementation of the one-hot encoder.
Looking at your problem , get_dummies is the option to go with as it would give equal weightage to the categorical variables. LabelEncoder is used when the categorical variables are ordinal i.e. if you are converting severity or ranking, then LabelEncoding "High" as 2 and "low" as 1 would make sense.
drop_first. The drop_first parameter specifies whether or not you want to drop the first category of the categorical variable you're encoding. By default, this is set to drop_first = False . This will cause get_dummies to create one dummy variable for every level of the input categorical variable.
Why are you converting the categorical datas to integers? I don't believe you save memory if that is your goal.
df = pd.DataFrame({'cat': pd.Categorical(['a', 'a', 'a', 'b', 'b', 'c'])})
df2 = pd.DataFrame({'cat': [1, 1, 1, 2, 2, 3]})
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 1 columns):
cat 6 non-null category
dtypes: category(1)
memory usage: 78.0 bytes
>>> df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 1 columns):
cat 6 non-null int64
dtypes: int64(1)
memory usage: 96.0 bytes
The categorical codes are just integer values for the unique items in the given category. By contrast, get_dummies returns a new column for each unique item. The value in the column indicates whether or not the record has that attribute.
>>> pd.core.reshape.get_dummies(df)
Out[30]:
cat_a cat_b cat_c
0 1 0 0
1 1 0 0
2 1 0 0
3 0 1 0
4 0 1 0
5 0 0 1
To get the codes directly, you can use:
df['codes'] = [df.cat.codes.to_list()]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With