OneHotEncoder - encoding only some of categorical variable columns

Tags:

Let's assume that I have a pandas dataframe with the following column names:

'age' (e.g. 33, 26, 51 etc)
'seniority' (e.g. 'junior', 'senior' etc)
'gender' (e.g. 'male', 'female')
'salary' (e.g. 32000, 40000, 64000 etc)

I want to transform the seniority categorical variables to one hot encoded values. For this reason I am doing the following:

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data['seniority'] = label_encoder.fit_transform(data['seniority'])

from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(categorical_features=[1])
data = one_hot_encoder.fit_transform(data.values)

But then I am getting this error

ValueError: could not convert string to float: 'gender'

at line

data = one_hot_encoder.fit_transform(data.values)

However, I have explicitly specified that categorical_features=[1] so only column 1 (seniority) should be considered for this one hot encoding.

How can I fix this error (except for example by dropping the column 'gender')?

I was using pandas.get_dummies in the past and I did not have this problem.

623

asked Sep 20 '18 18:09

Outcast

1 Answers

I think for this case you should stick to pd.get_dummies:

>>> data
   age seniority  gender  salary
0    1    junior    male       5
1    2    senior  female       6
2    3    junior  female       7

# One hot encode with get_dummies
data = pd.concat((data,pd.get_dummies(data.seniority)),1)

>>> data
   age seniority  gender  salary  junior  senior
0    1    junior    male       5       1       0
1    2    senior  female       6       0       1
2    3    junior  female       7       1       0

The problem is that sklearn's OneHotEncoder needs to have an array of ints as input. But in the array data.values, you still have the string representation of gender. You could, if you wanted, just one hot encode the seniority values, but if you want to know the meaning of those features, it's not very nice, you have to pass it the column names manually (which is unfeasible in a lot of cases):

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data['seniority'] = label_encoder.fit_transform(data['seniority'])

from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)
data[['junior','senior']] = one_hot_encoder.fit_transform(data['seniority'].values.reshape(-1,1))

>>> data
   age  seniority  gender  salary  junior  senior
0    1          0    male       5     1.0     0.0
1    2          1  female       6     0.0     1.0
2    3          0  female       7     1.0     0.0

Or, if the feature names don't matter:

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data['seniority'] = label_encoder.fit_transform(data['seniority'])

from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)
data = pd.concat((data,pd.DataFrame(one_hot_encoder.fit_transform(data['seniority'].values.reshape(-1,1)))),1)

   age  seniority  gender  salary    0    1
0    1          0    male       5  1.0  0.0
1    2          1  female       6  0.0  1.0
2    3          0  female       7  1.0  0.0

But in the end, pd.get_dummies does the job in a much nicer way (IMO)

119

answered Sep 21 '22 15:09

sacuL

Related questions
                            
                                keras LSTM layer takes too long to train
                            
                                Google Dataflow - Failed to import custom python modules
                            
                                PySpark Error When running SQL Query
                            
                                Preserving Spaces in Tesseract
                            
                                Image Preprocessing for OCR - Tessaract
                            
                                How to detect method calls made by Python behind the scenes?
                            
                                What is the difference, if any, between using single quote and double quote in a python dictionary? [duplicate]
                            
                                How to make a python mocked out function return a specific value conditional on an argument to the function?
                            
                                Add transparent picture over plot
                            
                                Cant Pickle memoized class instance
                            
                                "ImportError: Failed to load GLFW3 shared library" without root access on Linux
                            
                                How does shuffling work with ImageDataGenerator in Machine Learning?
                            
                                How to model a shared layer in keras?
                            
                                sigmoid_cross_entropy loss function from tensorflow for image segmentation
                            
                                Python 3.5 string format: How to add a thousands-separator and also right justify?
                            
                                How to duplicate a specific value in a list/array?
                            
                                single element in a list
                            
                                Django initialize data test for all test classes
                            
                                Store filtered output of cmd command in a variable
                            
                                TypeError: 'dict_items' object is not subscriptable on running if statement to shortlist items

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

OneHotEncoder - encoding only some of categorical variable columns

Tags:

python

one-hot-encoding

scikit-learn

Outcast

People also ask

1 Answers

sacuL

Recent Activity

Donate For Us