How to use OneHotEncoder for multiple columns and automatically drop first dummy variable for each column?

Tags:

This is the dataset with 3 cols and 3 rows

Name Organization Department

Manie ABC2 FINANCE

Joyce ABC1 HR

Ami NSV2 HR

This is the code I have:

Now it is fine till here, how do i drop the first dummy variable column for each ?

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Data1.csv',encoding = "cp1252")
X = dataset.values


# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_0 = LabelEncoder()
X[:, 0] = labelencoder_X_0.fit_transform(X[:, 0])
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])

onehotencoder = OneHotEncoder(categorical_features = "all")
X = onehotencoder.fit_transform(X).toarray()

293

asked Jun 17 '17 06:06

4 Answers

import pandas as pd
df = pd.DataFrame({'name': ['Manie', 'Joyce', 'Ami'],
                   'Org':  ['ABC2', 'ABC1', 'NSV2'],
                   'Dept': ['Finance', 'HR', 'HR']        
        })


df_2 = pd.get_dummies(df,drop_first=True)

test:

print(df_2)
   Dept_HR  Org_ABC2  Org_NSV2  name_Joyce  name_Manie
0        0         1         0           0           1
1        1         0         0           1           0
2        1         0         1           0           0

UPDATE regarding your error with pd.get_dummies(X, columns =[1:]:

Per the documentation page, the columns parameter takes "Column Names". So the following code would work:

df_2 = pd.get_dummies(df, columns=['Org', 'Dept'], drop_first=True)

output:

    name  Org_ABC2  Org_NSV2  Dept_HR
0  Manie         1         0        0
1  Joyce         0         0        1
2    Ami         0         1        1

If you really want to define your columns positionally, you could do it this way:

column_names_for_onehot = df.columns[1:]
df_2 = pd.get_dummies(df, columns=column_names_for_onehot, drop_first=True)

answered Oct 19 '22 18:10

Max Power

I use my own template for doing that:

from sklearn.base import TransformerMixin
import pandas as pd
import numpy as np
class DataFrameEncoder(TransformerMixin):

    def __init__(self):
        """Encode the data.

        Columns of data type object are appended in the list. After 
        appending Each Column of type object are taken dummies and 
        successively removed and two Dataframes are concated again.

        """
    def fit(self, X, y=None):
        self.object_col = []
        for col in X.columns:
            if(X[col].dtype == np.dtype('O')):
                self.object_col.append(col)
        return self

    def transform(self, X, y=None):
        dummy_df = pd.get_dummies(X[self.object_col],drop_first=True)
        X = X.drop(X[self.object_col],axis=1)
        X = pd.concat([dummy_df,X],axis=1)
        return X

And for using this code just put this template in current directory with filename let's suppose CustomeEncoder.py and type in your code:

from customEncoder import DataFrameEncoder
data = DataFrameEncoder().fit_transormer(data)

And all the object type data removed, Encoded, removed first and joined together to give the final desired output.
PS: That the input file to this template is Pandas Dataframe.

answered Oct 19 '22 18:10

It is quite simple in scikit-learn version starting from 0.21. One can use the drop parameter in OneHotEncoder and use it to drop one of the categories per feature. By default, it won't drop. Details can be found in documentation.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

//drops the first category in each feature
ohe = OneHotEncoder(drop='first', handle_unknown='error')

answered Oct 19 '22 17:10

Jyoti Prasad Pal

I use my own module for dealing with one hot encoding.

from sklearn.preprocessing import OneHotEncoder
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class My_encoder(BaseEstimator, TransformerMixin):
   
    def __init__(self,drop = 'first',sparse=False):
        self.encoder = OneHotEncoder(drop = drop,sparse = sparse)
        self.features_to_encode = []
        self.columns = []
    
    def fit(self,X_train,features_to_encode):
        
        data = X_train.copy()
        self.features_to_encode = features_to_encode
        data_to_encode = data[self.features_to_encode]
        self.columns = pd.get_dummies(data_to_encode,drop_first = True).columns
        self.encoder.fit(data_to_encode)
        return self.encoder
    
    def transform(self,X_test):
        
        data = X_test.copy()
        data.reset_index(drop = True,inplace =True)
        data_to_encode = data[self.features_to_encode]
        data_left = data.drop(self.features_to_encode,axis = 1)
        
        data_encoded = pd.DataFrame(self.encoder.transform(data_to_encode),columns = self.columns)
        
        return pd.concat([data_left,data_encoded],axis = 1)

Its pretty easy to use

features_to_encode = [---list of features to one hot encode--]
enc = My_encoder()
enc.fit(X_train,features_to_encode)
X_train = enc.transform(X_train)
X_test = enc.transform(X_test)

It returns dataframe with columns names. So, covers both the disadvantages of OneHotEncoder and pd.get_dummies(). So, we can use it to fit and transform, like OneHotEncoder, and also it saves us the column names and returns a datafram like dummies approach.

answered Oct 19 '22 18:10

Hardik Kamboj

Related questions
                            
                                Best way to plot an angle between two lines in Matplotlib
                            
                                Python odbc; how to find all tables in an odbc
                            
                                NameError: name 'requests' is not defined [closed]
                            
                                rsync skip non existing files on source
                            
                                Create vertical NumPy arrays in Python
                            
                                OpenCV return keypoints coordinates and area from blob detection, Python
                            
                                pandas dataframe hexbin plot has no xlabel or axis values
                            
                                Spark - Creating Nested DataFrame
                            
                                Add a legend to my heatmap plot
                            
                                SyntaxError with passing **kwargs and trailing comma
                            
                                Flask hangs when sending a post request to itself
                            
                                How to define custom properties in enumeration in Python (Javascript-like) [duplicate]
                            
                                How to extract zip file recursively?
                            
                                Converting a float to bytearray
                            
                                Can't build wheel - error: invalid command 'bdist_wheel'
                            
                                Remove empty sub plots in matplotlib figure
                            
                                How to Remove a Substring of String in a Dataframe Column?
                            
                                What is a mapping object, according to dict type?
                            
                                "Invalid parameter type" (numpy.int64) when inserting rows with executemany()
                            
                                pyspark's "between" function: range search on timestamps is not inclusive

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use OneHotEncoder for multiple columns and automatically drop first dummy variable for each column?

Tags:

python

pandas

machine-learning

scikit-learn

Vijay

People also ask

4 Answers

Max Power

MD Rijwan

Jyoti Prasad Pal

Hardik Kamboj

Recent Activity

Donate For Us