This is the dataset with 3 cols and 3 rows
Name Organization Department
Manie ABC2 FINANCE
Joyce ABC1 HR
Ami NSV2 HR
This is the code I have:
Now it is fine till here, how do i drop the first dummy variable column for each ?
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Data1.csv',encoding = "cp1252")
X = dataset.values
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_0 = LabelEncoder()
X[:, 0] = labelencoder_X_0.fit_transform(X[:, 0])
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
onehotencoder = OneHotEncoder(categorical_features = "all")
X = onehotencoder.fit_transform(X).toarray()
Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer. Create a separate pipeline for categorical and numerical variable and apply ColumnTransformer. More info about it can be found here ColumnTransformer. Another great example of implementation of this is provided here.
One-hot encoding can be used to handle a large number of categories also. How does it do this? Suppose 200 categories are present in a feature then only those 10 categories which are the top 10 repeating categories will be chosen and one-hot encoding is applied to only those categories.
A dummy (binary) variable just takes the value 0 or 1 to indicate the exclusion or inclusion of a category. In one-hot encoding, “Red” color is encoded as [1 0 0] vector of size 3. “Green” color is encoded as [0 1 0] vector of size 3.
(1) The get_dummies can't handle the unknown category during the transformation natively. You have to apply some techniques to handle it. But it is not efficient. On the other hand, OneHotEncoder will natively handle unknown categories.
The dummy variables should go to the beginning index of your data set. Then, just cut off the first column like this: Then encode and repeat the next variable. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer. Create a separate pipeline for categorical and numerical variable and apply ColumnTransformer.
Think twice before dropping that first one-hot encoded column Red Huq 2019-05-06 19:30 Comments Many machine learning models demand that categorical features are converted to a format they can comprehend via a widely used feature engineering technique called one-hot encoding.
Then encode and repeat the next variable. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer. Create a separate pipeline for categorical and numerical variable and apply ColumnTransformer. More info about it can be found here ColumnTransformer.
Encode the categorical variables one at a time. The dummy variables should go to the beginning index of your data set. Then, just cut off the first column like this: Then encode and repeat the next variable. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.
import pandas as pd
df = pd.DataFrame({'name': ['Manie', 'Joyce', 'Ami'],
'Org': ['ABC2', 'ABC1', 'NSV2'],
'Dept': ['Finance', 'HR', 'HR']
})
df_2 = pd.get_dummies(df,drop_first=True)
test:
print(df_2)
Dept_HR Org_ABC2 Org_NSV2 name_Joyce name_Manie
0 0 1 0 0 1
1 1 0 0 1 0
2 1 0 1 0 0
UPDATE regarding your error with pd.get_dummies(X, columns =[1:]
:
Per the documentation page, the columns
parameter takes "Column Names". So the following code would work:
df_2 = pd.get_dummies(df, columns=['Org', 'Dept'], drop_first=True)
output:
name Org_ABC2 Org_NSV2 Dept_HR
0 Manie 1 0 0
1 Joyce 0 0 1
2 Ami 0 1 1
If you really want to define your columns positionally, you could do it this way:
column_names_for_onehot = df.columns[1:]
df_2 = pd.get_dummies(df, columns=column_names_for_onehot, drop_first=True)
I use my own template for doing that:
from sklearn.base import TransformerMixin
import pandas as pd
import numpy as np
class DataFrameEncoder(TransformerMixin):
def __init__(self):
"""Encode the data.
Columns of data type object are appended in the list. After
appending Each Column of type object are taken dummies and
successively removed and two Dataframes are concated again.
"""
def fit(self, X, y=None):
self.object_col = []
for col in X.columns:
if(X[col].dtype == np.dtype('O')):
self.object_col.append(col)
return self
def transform(self, X, y=None):
dummy_df = pd.get_dummies(X[self.object_col],drop_first=True)
X = X.drop(X[self.object_col],axis=1)
X = pd.concat([dummy_df,X],axis=1)
return X
And for using this code just put this template in current directory with filename let's suppose CustomeEncoder.py and type in your code:
from customEncoder import DataFrameEncoder
data = DataFrameEncoder().fit_transormer(data)
And all the object type data removed, Encoded, removed first and joined together to give the final desired output.
PS: That the input file to this template is Pandas Dataframe.
It is quite simple in scikit-learn version starting from 0.21. One can use the drop parameter in OneHotEncoder and use it to drop one of the categories per feature. By default, it won't drop. Details can be found in documentation.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder
//drops the first category in each feature
ohe = OneHotEncoder(drop='first', handle_unknown='error')
I use my own module for dealing with one hot encoding.
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
class My_encoder(BaseEstimator, TransformerMixin):
def __init__(self,drop = 'first',sparse=False):
self.encoder = OneHotEncoder(drop = drop,sparse = sparse)
self.features_to_encode = []
self.columns = []
def fit(self,X_train,features_to_encode):
data = X_train.copy()
self.features_to_encode = features_to_encode
data_to_encode = data[self.features_to_encode]
self.columns = pd.get_dummies(data_to_encode,drop_first = True).columns
self.encoder.fit(data_to_encode)
return self.encoder
def transform(self,X_test):
data = X_test.copy()
data.reset_index(drop = True,inplace =True)
data_to_encode = data[self.features_to_encode]
data_left = data.drop(self.features_to_encode,axis = 1)
data_encoded = pd.DataFrame(self.encoder.transform(data_to_encode),columns = self.columns)
return pd.concat([data_left,data_encoded],axis = 1)
Its pretty easy to use
features_to_encode = [---list of features to one hot encode--]
enc = My_encoder()
enc.fit(X_train,features_to_encode)
X_train = enc.transform(X_train)
X_test = enc.transform(X_test)
It returns dataframe with columns names. So, covers both the disadvantages of OneHotEncoder and pd.get_dummies(). So, we can use it to fit and transform, like OneHotEncoder, and also it saves us the column names and returns a datafram like dummies approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With