Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use OneHotEncoder for multiple columns and automatically drop first dummy variable for each column?

This is the dataset with 3 cols and 3 rows

Name Organization Department

Manie   ABC2 FINANCE

Joyce   ABC1 HR

Ami   NSV2 HR

This is the code I have:

Now it is fine till here, how do i drop the first dummy variable column for each ?

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Data1.csv',encoding = "cp1252")
X = dataset.values


# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_0 = LabelEncoder()
X[:, 0] = labelencoder_X_0.fit_transform(X[:, 0])
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])

onehotencoder = OneHotEncoder(categorical_features = "all")
X = onehotencoder.fit_transform(X).toarray()
like image 293
Vijay Avatar asked Jun 17 '17 06:06

Vijay


People also ask

How do I use OneHotEncoder on multiple columns?

Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer. Create a separate pipeline for categorical and numerical variable and apply ColumnTransformer. More info about it can be found here ColumnTransformer. Another great example of implementation of this is provided here.

How do you perform one hot encoding for multi categorical variables?

One-hot encoding can be used to handle a large number of categories also. How does it do this? Suppose 200 categories are present in a feature then only those 10 categories which are the top 10 repeating categories will be chosen and one-hot encoding is applied to only those categories.

What is the difference between a hot encoding and a dummy variable?

A dummy (binary) variable just takes the value 0 or 1 to indicate the exclusion or inclusion of a category. In one-hot encoding, “Red” color is encoded as [1 0 0] vector of size 3. “Green” color is encoded as [0 1 0] vector of size 3.

What is the difference between OneHotEncoder and Get_dummies?

(1) The get_dummies can't handle the unknown category during the transformation natively. You have to apply some techniques to handle it. But it is not efficient. On the other hand, OneHotEncoder will natively handle unknown categories.

How to use onehotencoder with dummy variables?

The dummy variables should go to the beginning index of your data set. Then, just cut off the first column like this: Then encode and repeat the next variable. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer. Create a separate pipeline for categorical and numerical variable and apply ColumnTransformer.

Should You Drop That first one-hot encoded column?

Think twice before dropping that first one-hot encoded column Red Huq 2019-05-06 19:30 Comments Many machine learning models demand that categorical features are converted to a format they can comprehend via a widely used feature engineering technique called one-hot encoding.

How to apply onehotencoder to a specific column only?

Then encode and repeat the next variable. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer. Create a separate pipeline for categorical and numerical variable and apply ColumnTransformer. More info about it can be found here ColumnTransformer.

How do I use onehotencoder with categorical data?

Encode the categorical variables one at a time. The dummy variables should go to the beginning index of your data set. Then, just cut off the first column like this: Then encode and repeat the next variable. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.


4 Answers

import pandas as pd
df = pd.DataFrame({'name': ['Manie', 'Joyce', 'Ami'],
                   'Org':  ['ABC2', 'ABC1', 'NSV2'],
                   'Dept': ['Finance', 'HR', 'HR']        
        })


df_2 = pd.get_dummies(df,drop_first=True)

test:

print(df_2)
   Dept_HR  Org_ABC2  Org_NSV2  name_Joyce  name_Manie
0        0         1         0           0           1
1        1         0         0           1           0
2        1         0         1           0           0 

UPDATE regarding your error with pd.get_dummies(X, columns =[1:]:

Per the documentation page, the columns parameter takes "Column Names". So the following code would work:

df_2 = pd.get_dummies(df, columns=['Org', 'Dept'], drop_first=True)

output:

    name  Org_ABC2  Org_NSV2  Dept_HR
0  Manie         1         0        0
1  Joyce         0         0        1
2    Ami         0         1        1

If you really want to define your columns positionally, you could do it this way:

column_names_for_onehot = df.columns[1:]
df_2 = pd.get_dummies(df, columns=column_names_for_onehot, drop_first=True)
like image 57
Max Power Avatar answered Oct 19 '22 18:10

Max Power


I use my own template for doing that:

from sklearn.base import TransformerMixin
import pandas as pd
import numpy as np
class DataFrameEncoder(TransformerMixin):

    def __init__(self):
        """Encode the data.

        Columns of data type object are appended in the list. After 
        appending Each Column of type object are taken dummies and 
        successively removed and two Dataframes are concated again.

        """
    def fit(self, X, y=None):
        self.object_col = []
        for col in X.columns:
            if(X[col].dtype == np.dtype('O')):
                self.object_col.append(col)
        return self

    def transform(self, X, y=None):
        dummy_df = pd.get_dummies(X[self.object_col],drop_first=True)
        X = X.drop(X[self.object_col],axis=1)
        X = pd.concat([dummy_df,X],axis=1)
        return X

And for using this code just put this template in current directory with filename let's suppose CustomeEncoder.py and type in your code:

from customEncoder import DataFrameEncoder
data = DataFrameEncoder().fit_transormer(data)

And all the object type data removed, Encoded, removed first and joined together to give the final desired output.
PS: That the input file to this template is Pandas Dataframe.

like image 45
MD Rijwan Avatar answered Oct 19 '22 18:10

MD Rijwan


It is quite simple in scikit-learn version starting from 0.21. One can use the drop parameter in OneHotEncoder and use it to drop one of the categories per feature. By default, it won't drop. Details can be found in documentation.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

//drops the first category in each feature
ohe = OneHotEncoder(drop='first', handle_unknown='error')
like image 38
Jyoti Prasad Pal Avatar answered Oct 19 '22 17:10

Jyoti Prasad Pal


I use my own module for dealing with one hot encoding.

from sklearn.preprocessing import OneHotEncoder
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class My_encoder(BaseEstimator, TransformerMixin):
   
    def __init__(self,drop = 'first',sparse=False):
        self.encoder = OneHotEncoder(drop = drop,sparse = sparse)
        self.features_to_encode = []
        self.columns = []
    
    def fit(self,X_train,features_to_encode):
        
        data = X_train.copy()
        self.features_to_encode = features_to_encode
        data_to_encode = data[self.features_to_encode]
        self.columns = pd.get_dummies(data_to_encode,drop_first = True).columns
        self.encoder.fit(data_to_encode)
        return self.encoder
    
    def transform(self,X_test):
        
        data = X_test.copy()
        data.reset_index(drop = True,inplace =True)
        data_to_encode = data[self.features_to_encode]
        data_left = data.drop(self.features_to_encode,axis = 1)
        
        data_encoded = pd.DataFrame(self.encoder.transform(data_to_encode),columns = self.columns)
        
        return pd.concat([data_left,data_encoded],axis = 1)

Its pretty easy to use

features_to_encode = [---list of features to one hot encode--]
enc = My_encoder()
enc.fit(X_train,features_to_encode)
X_train = enc.transform(X_train)
X_test = enc.transform(X_test)

It returns dataframe with columns names. So, covers both the disadvantages of OneHotEncoder and pd.get_dummies(). So, we can use it to fit and transform, like OneHotEncoder, and also it saves us the column names and returns a datafram like dummies approach.

like image 1
Hardik Kamboj Avatar answered Oct 19 '22 18:10

Hardik Kamboj