Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sklearn Label Encoding multiple columns pandas dataframe

I try to encode a number of columns containing categorical data ("Yes" and "No") in a large pandas dataframe. The complete dataframe contains over 400 columns so I look for a way to encode all desired columns without having to encode them one by one. I use Scikit-learn LabelEncoder to encode the categorical data.

The first part of the dataframe does not have to be encoded, however I am looking for a method to encode all the desired columns containing categorical date directly without split and concatenate the dataframe.

To demonstrate my question I first tried to solve it on a small part of the dataframe. However get stuck at the last part where the data is fitted and transformed and get a ValueError: bad input shape (4,3). The code as I ran:

# Create a simple dataframe resembling large dataframe
    data = pd.DataFrame({'A': [1, 2, 3, 4],
                         'B': ["Yes", "No", "Yes", "Yes"],
                         'C': ["Yes", "No", "No", "Yes"],
                         'D': ["No", "Yes", "No", "Yes"]})


# Import required module
from sklearn.preprocessing import LabelEncoder

# Create an object of the label encoder class
labelencoder = LabelEncoder()

# Apply labelencoder object on columns
labelencoder.fit_transform(data.ix[:, 1:])   # First column does not need to be encoded

Complete error report:

labelencoder.fit_transform(data.ix[:, 1:])
Traceback (most recent call last):

  File "<ipython-input-47-b4986a719976>", line 1, in <module>
    labelencoder.fit_transform(data.ix[:, 1:])

  File "C:\Anaconda\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 129, in fit_transform
    y = column_or_1d(y, warn=True)

  File "C:\Anaconda\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 562, in column_or_1d
    raise ValueError("bad input shape {0}".format(shape))

ValueError: bad input shape (4, 3)

Does anyone know how to do this?

like image 591
HelloBlob Avatar asked Jun 10 '17 14:06

HelloBlob


People also ask

How do you label encode multiple columns together?

Instead of LabelEncoder we can use OrdinalEncoder from scikit learn, which allows multi-column encoding. Encode categorical features as an integer array. The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features.

What is the difference between LabelEncoder and OrdinalEncoder?

OrdinalEncoder is for converting features, while LabelEncoder is for converting target variable.

What is the LabelEncoder () method?

LabelEncoder can be used to normalize labels. It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels. Fit label encoder. Fit label encoder and return encoded labels.

What is the difference between OneHotEncoder and LabelEncoder?

As you can see, we have three new columns with 1s and 0s, depending on the country that the rows represent. So, that's the difference between Label Encoding and One Hot Encoding. Follow me on Twitter for more Data Science, Machine Learning, and general tech updates.


2 Answers

As the following code, you can encode the multiple columns by applying LabelEncoder to DataFrame. However, please note that we cannot obtain the classes information for all columns.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'A': [1, 2, 3, 4],
                   'B': ["Yes", "No", "Yes", "Yes"],
                   'C': ["Yes", "No", "No", "Yes"],
                   'D': ["No", "Yes", "No", "Yes"]})
print(df)
#    A    B    C    D
# 0  1  Yes  Yes   No
# 1  2   No   No  Yes
# 2  3  Yes   No   No
# 3  4  Yes  Yes  Yes

# LabelEncoder
le = LabelEncoder()

# apply "le.fit_transform"
df_encoded = df.apply(le.fit_transform)
print(df_encoded)
#    A  B  C  D
# 0  0  1  1  0
# 1  1  0  0  1
# 2  2  1  0  0
# 3  3  1  1  1

# Note: we cannot obtain the classes information for all columns.
print(le.classes_)
# ['No' 'Yes']
like image 159
Keiku Avatar answered Sep 28 '22 05:09

Keiku


First, find out all the features with type object:

objList = all_data.select_dtypes(include = "object").columns
print (objList)

Now, to convert the above objList features into numeric type, you can use a forloop as given below:

#Label Encoding for object to numeric conversion
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for feat in objList:
    df[feat] = le.fit_transform(df[feat].astype(str))

print (df.info())

Note that we are explicitly mentioning as type string in the forloop because if you remove that it throws an error.

like image 25
Darshan Jain Avatar answered Sep 28 '22 05:09

Darshan Jain